Polynomial Regression

Posted Nov 9, 2017 · 8 min read

Introduction: side courses

As a PhD student in the UK system, I was expecting a lot less coursework, with my first year diving straight into research. However, there are still a lot of gaps in my knowledge, so I hope to always be on the lookout for learning opportunities, including side classes.

At the moment, I’m hoping to follow along with these three courses and do some assignments from time to time:

With a few other Oxford students, I’m keeping up with the machine learning course at UCL’s Gatsby Computational Neuroscience Unit, taught by Maneesh Sahani. I’m hoping to refresh the statistical foundations of machine learning, and particularly learn approximate inference. A previous year’s course materials are here.
My PhD is in statistical genetics, but my background is primarily in machine learning. So I’ve been working through Barbara Englehardt’s Spring 2013 Duke course, STA613/CBB540: Statistical methods in computational biology. The first assignment was also a really smooth introduction to learning R.
While at Harvard, I took two semesters of machine learning with Ryan Adams. The second semester was a graduate class, Fall 2013 of CS 281: Advanced Machine Learning, and was a lot more difficult. I didn’t get around to finishing all the assignments, so have intended to go back and solve some of the problems, including Assignment 5 on Gaussian Processes and Bayesian Nonparametrics which I skipped, since we could drop one lowest grade.

All materials from these classes are online at the above links, so I would recommend checking them out!

Data from a CS 281 assignment

In this post, I’ll work through an example of polynomial regression, while illustrating some R features like ggplot and R Markdown documents. The data that I’ll use comes from the CS 281 course linked to above, in problem 4 of assignment 2:

Here are some simple data to regress:

x = [-1.87 -1.76 -1.67 -1.22 -0.07 0.11 0.67 1.60 2.22 2.51]'
y = [0.06 1.67 0.54 -1.45 -0.18 -0.67 0.92 2.95 5.13 5.18]'

Construct a Bayesian linear regression model using a basis of your choosing (e.g., polynomial, sinusoids, radial basis functions). Choose priors that seem sensible for the regression weights and the Gaussian noise.

While taking the course, I didn’t manage to fully solve the problem. However, for now, I’ll be using the data points as a demonstration not of Bayesian regression, but the simpler case of least-squares regression without a prior.

To start, we can create variables for our data points and plot them using ggplot.

library(ggplot2)
center_title = theme(plot.title = element_text(hjust = 0.5))
x = c(-1.87, -1.76, -1.67, -1.22, -0.07, 0.11, 0.67, 1.60, 2.22, 2.51)
y = c(0.06, 1.67, 0.54, -1.45, -0.18, -0.67, 0.92, 2.95, 5.13, 5.18)
ggplot(data.frame(x = x, y = y), aes(x=x, y=y)) + geom_point() +
  labs(title = "Points to regress") + center_title

A simple linear regression

Linear least-squares regression seeks to learn the parameters \(\beta_0, \beta_1\) that minimize \[ \sum_i (y_i-(\beta_0 + \beta_1 x_i))^2 \] If we introduce matrices \[ \mathbf{y} = \begin{pmatrix} y_1\\ y_2\\ \vdots \\ y_n \end{pmatrix}, \mathbf{X} = \begin{pmatrix} 1 & x_1\\ 1 & x_2\\ \vdots & \vdots\\ 1 & x_n\\ \end{pmatrix}, \boldsymbol{\beta} = \begin{pmatrix} \beta_0 \\ \beta_1 \end{pmatrix} \] The loss function can be written as \[ l(\boldsymbol{\beta}; \mathbf{X}, \mathbf{y}) = (\mathbf{y}-\mathbf{X}\boldsymbol{\beta})^\intercal (\mathbf{y}-\mathbf{X}\boldsymbol{\beta}) \] Without going through all the algebra, the minimum of this function with respect to \(\boldsymbol{\beta}\) can be obtained by setting the derivative to zero. After using some matrix identities, the solution is \[ \mathbf{\hat{\boldsymbol{\beta}}} = (\mathbf{X}^\intercal\mathbf{X})^{-1} \mathbf{X}^\intercal \mathbf{y} \] In R, the function lm can solve linear regression for us, but instead we can also code up the solution manually. First, we set up our matrices \(\mathbf{X}\) and \(\mathbf{y}\):

x_matrix = cbind(rep(1, 10), x)
x_matrix

##             x
##  [1,] 1 -1.87
##  [2,] 1 -1.76
##  [3,] 1 -1.67
##  [4,] 1 -1.22
##  [5,] 1 -0.07
##  [6,] 1  0.11
##  [7,] 1  0.67
##  [8,] 1  1.60
##  [9,] 1  2.22
## [10,] 1  2.51

y_matrix = as.matrix(y)
y_matrix

##        [,1]
##  [1,]  0.06
##  [2,]  1.67
##  [3,]  0.54
##  [4,] -1.45
##  [5,] -0.18
##  [6,] -0.67
##  [7,]  0.92
##  [8,]  2.95
##  [9,]  5.13
## [10,]  5.18

Next we write the matrix algebra solution for computing \(\mathbf{\hat{\boldsymbol{\beta}}}\). If you’re new to R, the notation may be unfamiliar to you, and you may want to consult this cheatsheet.¹

beta = solve(t(x_matrix) %*% x_matrix) %*% t(x_matrix) %*% y_matrix
beta

##       [,1]
##   1.359589
## x 1.065601

Finally, we plot the result.

ggplot(data.frame(x = x, y = y), aes(x=x, y=y)) + geom_point() +
  labs(title = "Linear regression") + center_title +
  geom_abline(slope = beta[2], intercept = beta[1])

Generalizing to polynomial regression

What if instead of fitting a line, we wanted to fit a quadratic to our data? Our quadratic would then have three parameters, and we would want to minimize: \[ \sum_i (y_i-(\beta_0 + \beta_1 x_i + \beta_2 x_i^2))^2 \] This loss function \(l(\boldsymbol{\beta}; \mathbf{X}, \mathbf{y})\) can be put in the exact same form as before, as long as we instead define our matrices as:² \[ \mathbf{y} = \begin{pmatrix} y_1\\ y_2\\ \vdots \\ y_n \end{pmatrix}, \mathbf{X} = \begin{pmatrix} 1 & x_1 & x_1^2\\ 1 & x_2 & x_2^2\\ \vdots & \vdots & \vdots \\ 1 & x_n & x_n^2\\ \end{pmatrix}, \boldsymbol{\beta} = \begin{pmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \end{pmatrix} \] We then get: \[\begin{gather*} l(\boldsymbol{\beta}; \mathbf{X}, \mathbf{y}) = (\mathbf{y}-\mathbf{X}\boldsymbol{\beta})^\intercal (\mathbf{y}-\mathbf{X}\boldsymbol{\beta})\\ \mathbf{\hat{\boldsymbol{\beta}}} = (\mathbf{X}^\intercal\mathbf{X})^{-1} \mathbf{X}^\intercal \mathbf{y} \end{gather*}\]

It should be clear that this pattern continues on, at least until the degree \(d\) of the polynomial equals \(n - 1\). At that point, we are able to hit all our \(n\) points exactly (assuming the \(x\)-coordinates are all different), since the polynomial has \(n\) free parameters. If we then take \(d \geq n\), there are multiple polynomial solutions, and we should no longer trust our expression to give the single right solution.³

Using the above result, we can perform polynomial regression on our \(n = 10\) data points with \(d = 0\) to \(9\). Once we have the \(\boldsymbol{\beta}\) coefficients, we sweep over an \(x\)-range from -2.5 to 3 and generate the corresponding \(y\) value for the polynomial. We keep the resulting polynomial data in a data frame.

x_matrix = NULL
y_matrix = as.matrix(y)
x_curve = seq(-2.5, 3, 0.05)
y_data = NULL
for(i in 1:10) {
  x_matrix = cbind(x_matrix, x^(i-1))
  beta = solve(t(x_matrix) %*% x_matrix) %*% t(x_matrix) %*% y_matrix
  y_curve = rep(0, length(x_curve))
  for(j in 1:i) {
    y_curve = y_curve + beta[j]*x_curve^(j-1)
  }
  y_data = rbind(y_data, data.frame(x=x_curve, y=y_curve, d=(i-1)))
}

Let’s take a look at our final matrix for \(\mathbf{X}\) for \(d=9\).

round(x_matrix, digits=2)

##       [,1]  [,2] [,3]  [,4]  [,5]   [,6]   [,7]   [,8]    [,9]   [,10]
##  [1,]    1 -1.87 3.50 -6.54 12.23 -22.87  42.76 -79.96  149.53 -279.62
##  [2,]    1 -1.76 3.10 -5.45  9.60 -16.89  29.72 -52.31   92.07 -162.04
##  [3,]    1 -1.67 2.79 -4.66  7.78 -12.99  21.69 -36.23   60.50 -101.03
##  [4,]    1 -1.22 1.49 -1.82  2.22  -2.70   3.30  -4.02    4.91   -5.99
##  [5,]    1 -0.07 0.00  0.00  0.00   0.00   0.00   0.00    0.00    0.00
##  [6,]    1  0.11 0.01  0.00  0.00   0.00   0.00   0.00    0.00    0.00
##  [7,]    1  0.67 0.45  0.30  0.20   0.14   0.09   0.06    0.04    0.03
##  [8,]    1  1.60 2.56  4.10  6.55  10.49  16.78  26.84   42.95   68.72
##  [9,]    1  2.22 4.93 10.94 24.29  53.92 119.71 265.75  589.96 1309.71
## [10,]    1  2.51 6.30 15.81 39.69  99.63 250.06 627.65 1575.40 3954.24

Here is what some of our output data frame looks like:

head(y_data[y_data$d == 2,])

##         x        y d
## 223 -2.50 1.970514 2
## 224 -2.45 1.843023 2
## 225 -2.40 1.718879 2
## 226 -2.35 1.598079 2
## 227 -2.30 1.480625 2
## 228 -2.25 1.366517 2

And now, we plot the results. Plotting all the polynomials at once makes the graph too cluttered, so instead we only show two plots that help illustrate different behavior.

ggplot() + 
  geom_point(data=data.frame(x = x, y = y), aes(x=x, y=y)) +
  geom_line(data=y_data[y_data$d < 4,],
            aes(x=x, y=y, color=factor(d), group=factor(d))) +
  labs(title = "Polynomial least-squares regression", color="dimension") +
  center_title

ggplot() + 
  geom_point(data=data.frame(x = x, y = y), aes(x=x, y=y)) +
  geom_line(data=y_data[y_data$d %% 2 == 1,],
            aes(x=x, y=y, color=factor(d), group=factor(d))) +
  labs(title = "Polynomial least-squares regression", color="dimension") +
  center_title +
  coord_cartesian(ylim = c(-3, 8))

Summary

Hopefully, I’ve impressed upon you that:

Polynomial least-squares regression is simply multiple linear regression with an expanded basis.
ggplot’s are easy to make in R and look nice!

In a future post, I may circle back to this data and solve more of the original CS 281 assignment involving Bayesian regression.

Thanks to:

Ryan Adams for teaching CS 281 and making the course materials freely available.
William Chen for some ggplot starter code that helped me when I was first learning.
Constantin Ahlmann-Eltze for calling my attention to the differences between setting ylim directly versus inside cartesian_coord when using ggplot.

This blog post was generated from an R Markdown file using the knitr and blogdown packages. The original source can be downloaded from GitHub.

UPDATE 2017-11-21: added a reference to the Vandermonde matrix and made some other small fixes. UPDATE 2020-02-04: added a link showing that \(\text{rank}(\mathbf{A}^\intercal\mathbf{A}) = \text{rank}(\mathbf{A})\).

Another note for new R users: you can pull up the documentation for a command like solve by executing ?solve in the console.↩
Thanks to Shoucheng Zhang for pointing out that the general \(\mathbf{X}\) matrix is known as the Vandermonde matrix.↩
Interestingly, I tried this out on our data using \(d=n=10\). The math expression is \[ \mathbf{\hat{\boldsymbol{\beta}}} = (\mathbf{X}^\intercal\mathbf{X})^{-1} \mathbf{X}^\intercal \mathbf{y}, \] and when I get to the inversion step, I receive an error message:
```
Error in solve.default(t(x_matrix) %*% x_matrix) : 
system is computationally singular: reciprocal condition number = 1.10547e-20
```
This suggests that for \(d \geq n\), the matrix \((\mathbf{X}^\intercal\mathbf{X})\) is no longer full-rank. Indeed, looking online, I seem to find that for any real matrix \(\mathbf{A}\), \(\text{rank}(\mathbf{A}^\intercal\mathbf{A}) = \text{rank}(\mathbf{A}) = \text{rank}(\mathbf{A}^\intercal)\). In this case, where \(\mathbf{X}\) is \(n\) by \((d+1)\), we have \(\text{rank}(\mathbf{X}^\intercal\mathbf{X}) = \text{rank}(\mathbf{X}^\intercal) \leq n\), the number of columns of \(\mathbf{X}^\intercal\), but \(\mathbf{X}^\intercal\mathbf{X}\) is a \((d+1)\) by \((d+1)\) matrix, so it cannot be full-rank.

For a similar presentation in Python, see this blog post by Jake VanderPlas.↩

Brian Zhang's blog

Statistics and other topics