As a PhD student in the UK system, I was expecting a lot less coursework, with my first year diving straight into research. However, there are still a lot of gaps in my knowledge, so I hope to always be on the lookout for learning opportunities, including side classes.
At the moment, I’m hoping to follow along with these three courses and do some assignments from time to time:
All materials from these classes are online at the above links, so I would recommend checking them out!
In this post, I’ll work through an example of polynomial regression, while illustrating some R features like ggplot
and R Markdown documents. The data that I’ll use comes from the CS 281 course linked to above, in problem 4 of assignment 2:
Here are some simple data to regress:
x = [-1.87 -1.76 -1.67 -1.22 -0.07 0.11 0.67 1.60 2.22 2.51]'
y = [0.06 1.67 0.54 -1.45 -0.18 -0.67 0.92 2.95 5.13 5.18]'
Construct a Bayesian linear regression model using a basis of your choosing (e.g., polynomial, sinusoids, radial basis functions). Choose priors that seem sensible for the regression weights and the Gaussian noise.
While taking the course, I didn’t manage to fully solve the problem. However, for now, I’ll be using the data points as a demonstration not of Bayesian regression, but the simpler case of least-squares regression without a prior.
To start, we can create variables for our data points and plot them using ggplot
.
library(ggplot2)
center_title = theme(plot.title = element_text(hjust = 0.5))
x = c(-1.87, -1.76, -1.67, -1.22, -0.07, 0.11, 0.67, 1.60, 2.22, 2.51)
y = c(0.06, 1.67, 0.54, -1.45, -0.18, -0.67, 0.92, 2.95, 5.13, 5.18)
ggplot(data.frame(x = x, y = y), aes(x=x, y=y)) + geom_point() +
labs(title = "Points to regress") + center_title
Linear least-squares regression seeks to learn the parameters \(\beta_0, \beta_1\) that minimize \[
\sum_i (y_i-(\beta_0 + \beta_1 x_i))^2
\] If we introduce matrices \[
\mathbf{y} = \begin{pmatrix} y_1\\ y_2\\ \vdots \\ y_n \end{pmatrix},
\mathbf{X} = \begin{pmatrix}
1 & x_1\\
1 & x_2\\
\vdots & \vdots\\
1 & x_n\\
\end{pmatrix},
\boldsymbol{\beta} = \begin{pmatrix} \beta_0 \\ \beta_1 \end{pmatrix}
\] The loss function can be written as \[
l(\boldsymbol{\beta}; \mathbf{X}, \mathbf{y}) = (\mathbf{y}-\mathbf{X}\boldsymbol{\beta})^\intercal (\mathbf{y}-\mathbf{X}\boldsymbol{\beta})
\] Without going through all the algebra, the minimum of this function with respect to \(\boldsymbol{\beta}\) can be obtained by setting the derivative to zero. After using some matrix identities, the solution is \[
\mathbf{\hat{\boldsymbol{\beta}}} = (\mathbf{X}^\intercal\mathbf{X})^{-1} \mathbf{X}^\intercal \mathbf{y}
\] In R, the function lm
can solve linear regression for us, but instead we can also code up the solution manually. First, we set up our matrices \(\mathbf{X}\) and \(\mathbf{y}\):
x_matrix = cbind(rep(1, 10), x)
x_matrix
## x
## [1,] 1 -1.87
## [2,] 1 -1.76
## [3,] 1 -1.67
## [4,] 1 -1.22
## [5,] 1 -0.07
## [6,] 1 0.11
## [7,] 1 0.67
## [8,] 1 1.60
## [9,] 1 2.22
## [10,] 1 2.51
y_matrix = as.matrix(y)
y_matrix
## [,1]
## [1,] 0.06
## [2,] 1.67
## [3,] 0.54
## [4,] -1.45
## [5,] -0.18
## [6,] -0.67
## [7,] 0.92
## [8,] 2.95
## [9,] 5.13
## [10,] 5.18
Next we write the matrix algebra solution for computing \(\mathbf{\hat{\boldsymbol{\beta}}}\). If you’re new to R, the notation may be unfamiliar to you, and you may want to consult this cheatsheet.1
beta = solve(t(x_matrix) %*% x_matrix) %*% t(x_matrix) %*% y_matrix
beta
## [,1]
## 1.359589
## x 1.065601
Finally, we plot the result.
ggplot(data.frame(x = x, y = y), aes(x=x, y=y)) + geom_point() +
labs(title = "Linear regression") + center_title +
geom_abline(slope = beta[2], intercept = beta[1])
What if instead of fitting a line, we wanted to fit a quadratic to our data? Our quadratic would then have three parameters, and we would want to minimize: \[ \sum_i (y_i-(\beta_0 + \beta_1 x_i + \beta_2 x_i^2))^2 \] This loss function \(l(\boldsymbol{\beta}; \mathbf{X}, \mathbf{y})\) can be put in the exact same form as before, as long as we instead define our matrices as:2 \[ \mathbf{y} = \begin{pmatrix} y_1\\ y_2\\ \vdots \\ y_n \end{pmatrix}, \mathbf{X} = \begin{pmatrix} 1 & x_1 & x_1^2\\ 1 & x_2 & x_2^2\\ \vdots & \vdots & \vdots \\ 1 & x_n & x_n^2\\ \end{pmatrix}, \boldsymbol{\beta} = \begin{pmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \end{pmatrix} \] We then get: \[\begin{gather*} l(\boldsymbol{\beta}; \mathbf{X}, \mathbf{y}) = (\mathbf{y}-\mathbf{X}\boldsymbol{\beta})^\intercal (\mathbf{y}-\mathbf{X}\boldsymbol{\beta})\\ \mathbf{\hat{\boldsymbol{\beta}}} = (\mathbf{X}^\intercal\mathbf{X})^{-1} \mathbf{X}^\intercal \mathbf{y} \end{gather*}\]
It should be clear that this pattern continues on, at least until the degree \(d\) of the polynomial equals \(n - 1\). At that point, we are able to hit all our \(n\) points exactly (assuming the \(x\)-coordinates are all different), since the polynomial has \(n\) free parameters. If we then take \(d \geq n\), there are multiple polynomial solutions, and we should no longer trust our expression to give the single right solution.3
Using the above result, we can perform polynomial regression on our \(n = 10\) data points with \(d = 0\) to \(9\). Once we have the \(\boldsymbol{\beta}\) coefficients, we sweep over an \(x\)-range from -2.5 to 3 and generate the corresponding \(y\) value for the polynomial. We keep the resulting polynomial data in a data frame.
x_matrix = NULL
y_matrix = as.matrix(y)
x_curve = seq(-2.5, 3, 0.05)
y_data = NULL
for(i in 1:10) {
x_matrix = cbind(x_matrix, x^(i-1))
beta = solve(t(x_matrix) %*% x_matrix) %*% t(x_matrix) %*% y_matrix
y_curve = rep(0, length(x_curve))
for(j in 1:i) {
y_curve = y_curve + beta[j]*x_curve^(j-1)
}
y_data = rbind(y_data, data.frame(x=x_curve, y=y_curve, d=(i-1)))
}
Let’s take a look at our final matrix for \(\mathbf{X}\) for \(d=9\).
round(x_matrix, digits=2)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 1 -1.87 3.50 -6.54 12.23 -22.87 42.76 -79.96 149.53 -279.62
## [2,] 1 -1.76 3.10 -5.45 9.60 -16.89 29.72 -52.31 92.07 -162.04
## [3,] 1 -1.67 2.79 -4.66 7.78 -12.99 21.69 -36.23 60.50 -101.03
## [4,] 1 -1.22 1.49 -1.82 2.22 -2.70 3.30 -4.02 4.91 -5.99
## [5,] 1 -0.07 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## [6,] 1 0.11 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## [7,] 1 0.67 0.45 0.30 0.20 0.14 0.09 0.06 0.04 0.03
## [8,] 1 1.60 2.56 4.10 6.55 10.49 16.78 26.84 42.95 68.72
## [9,] 1 2.22 4.93 10.94 24.29 53.92 119.71 265.75 589.96 1309.71
## [10,] 1 2.51 6.30 15.81 39.69 99.63 250.06 627.65 1575.40 3954.24
Here is what some of our output data frame looks like:
head(y_data[y_data$d == 2,])
## x y d
## 223 -2.50 1.970514 2
## 224 -2.45 1.843023 2
## 225 -2.40 1.718879 2
## 226 -2.35 1.598079 2
## 227 -2.30 1.480625 2
## 228 -2.25 1.366517 2
And now, we plot the results. Plotting all the polynomials at once makes the graph too cluttered, so instead we only show two plots that help illustrate different behavior.
ggplot() +
geom_point(data=data.frame(x = x, y = y), aes(x=x, y=y)) +
geom_line(data=y_data[y_data$d < 4,],
aes(x=x, y=y, color=factor(d), group=factor(d))) +
labs(title = "Polynomial least-squares regression", color="dimension") +
center_title
ggplot() +
geom_point(data=data.frame(x = x, y = y), aes(x=x, y=y)) +
geom_line(data=y_data[y_data$d %% 2 == 1,],
aes(x=x, y=y, color=factor(d), group=factor(d))) +
labs(title = "Polynomial least-squares regression", color="dimension") +
center_title +
coord_cartesian(ylim = c(-3, 8))
Hopefully, I’ve impressed upon you that:
ggplot
’s are easy to make in R and look nice!In a future post, I may circle back to this data and solve more of the original CS 281 assignment involving Bayesian regression.
Thanks to:
ggplot
starter code that helped me when I was first learning.ylim
directly versus inside cartesian_coord
when using ggplot
.This blog post was generated from an R Markdown file using the knitr
and blogdown
packages. The original source can be downloaded from GitHub.
UPDATE 2017-11-21: added a reference to the Vandermonde matrix and made some other small fixes. UPDATE 2020-02-04: added a link showing that \(\text{rank}(\mathbf{A}^\intercal\mathbf{A}) = \text{rank}(\mathbf{A})\).
Another note for new R users: you can pull up the documentation for a command like solve
by executing ?solve
in the console.↩
Thanks to Shoucheng Zhang for pointing out that the general \(\mathbf{X}\) matrix is known as the Vandermonde matrix.↩
Interestingly, I tried this out on our data using \(d=n=10\). The math expression is \[ \mathbf{\hat{\boldsymbol{\beta}}} = (\mathbf{X}^\intercal\mathbf{X})^{-1} \mathbf{X}^\intercal \mathbf{y}, \] and when I get to the inversion step, I receive an error message:
Error in solve.default(t(x_matrix) %*% x_matrix) :
system is computationally singular: reciprocal condition number = 1.10547e-20
This suggests that for \(d \geq n\), the matrix \((\mathbf{X}^\intercal\mathbf{X})\) is no longer full-rank. Indeed, looking online, I seem to find that for any real matrix \(\mathbf{A}\), \(\text{rank}(\mathbf{A}^\intercal\mathbf{A}) = \text{rank}(\mathbf{A}) = \text{rank}(\mathbf{A}^\intercal)\). In this case, where \(\mathbf{X}\) is \(n\) by \((d+1)\), we have \(\text{rank}(\mathbf{X}^\intercal\mathbf{X}) = \text{rank}(\mathbf{X}^\intercal) \leq n\), the number of columns of \(\mathbf{X}^\intercal\), but \(\mathbf{X}^\intercal\mathbf{X}\) is a \((d+1)\) by \((d+1)\) matrix, so it cannot be full-rank.
For a similar presentation in Python, see this blog post by Jake VanderPlas.↩