Given a set of data, linear regression is a way to model to the data with a linear function. This can be done by minimizing a certain "cost function", which measures the deviation of the data from the hypothesized linear model. The linear model with the least cost then has the "best fit" for the data.
Consider a set of \(m\) 2-dimensional data points \[ \{(x_1, y_1), \dots, (x_m, y_m)\} \subset \mathbb{R}^2. \] A linear model for this data will be a line of the form \(y=mx+b\). We hope to find the values of slope \(m\) and intercept \(b\) that minimize the cost function, which we take as the average of the squared vertical distances from the data points to the line: \[ J(m, b) = \frac{1}{2m} \sum_{i=1}^m (m x_i + b - y_i)^2. \]
Note that \(J\) is a function of \(m\) and \(b\). How do we find the values of \(m\) and \(b\) that minimize \(J(m, b)\)? This is a multivariable optimization problem, and one method is to use gradient descent. The idea is that the (negative) gradient of \(J\) is a vector \(-\nabla J\) that points in the direction of steepest descrease of \(J\). So, if we incrementally follow the gradient of \(J\) through the \((m, b)\)-plane, we will eventually reach a value \((m', b')\) that (approximately) minimizes \(J\).
In general, any minimizer found by gradient descent will only be local, but it turns out that the cost function above always has a unique global minimizer, since it is a convex function. Here is the algorithm. Again think of the variables \(m\) and \(b\) as a point \((m,b)\) in the plane.
The gradient is the vector of partial derivatives of \(J\): \[ \nabla J(m, b) = \left( \frac{\partial J}{\partial m}(m, b), \frac{\partial J}{\partial b}(m, b) \right) \] where \[ \frac{\partial J}{\partial m}(m, b) = \frac{1}{m} \sum_{i=1}^m (m x_i + b - y_i) x_i \] \[ \frac{\partial J}{\partial b}(m, b) = \frac{1}{m} \sum_{i=1}^m (m x_i + b - y_i) \]
One can also consider more complicated models, such as quadratics (\(y = a x^2 + b x + c\)) or cubics (\(y = a x^3 + b x^2 + c x + d\)). Gradient descent still works in such cases, with modification to the cost function.
The app below illustrates how gradient descent finds the linear/quadratic/cubic curve of best fit. Random data is selected, and you can initiate Step 2 in the gradient descent algorithm. The number \(\alpha\) is called the "learning rate" and can also be controlled.