1 Estimating the linear relationship between two variables

In the plot above, one has data for two characteristics (x and y) for 500 observations (the number of red points). The cloud of points clearly indicates that higher values of x correspond to higher values of y. Ordinary Least Squares (OLS) is a technique for turning that correspondence into the best-fitting linear equation \[\hat{y} =\hat{a}+\hat{b}x\tag{1}\] achieved by minimizing \(\sum_i{(y_{i}-\hat{y_{i}})^{2}}\), where each i is one of the 500 observations. That is, one finds the values of \(\hat{a}\) (the y-intercept) and \(\hat{b}\) (the slope) that lead to the smallest sum of squared differences between the actual values of y and the estimated values \(\hat{y}\) produced using \(\hat{a}\) and \(\hat{b}\). Minimizing the sum of squares, hence the name least squares.

\[\min_{\hat{a},\hat{b}} \sum_i {(y_{i}-(\hat{a}+\hat{b}x_i))^{2}}\tag{2} \] The OLS estimator of the parameters of a linear equation has three features I would like to call to your attention:

  • The difference between y and its predicted value can be either positive (the points above the green line have a positive difference) or negative (the points below the green line). Squaring converts all of these values to positive differences, so that minimization correctly takes us to the best fit.
  • Squaring the difference between y and its predicted value puts a proportionally larger penalty on big differences. For example, if the difference is 1, then the difference-squared is 1; but if the difference is 3, then the difference-squared is 9. This is a desirable feature, since we would typically be more concerned about big differences than many small differences.
  • The solution to Equation (2) is relatively easy to calculate, making this a particularly attractive estimator before computers became cheap and powerful.

2 Deriving the OLS estimators

Slightly rewrite Equation (2):

\[S =\sum_i {(y_{i}-\hat{a}-\hat{b}x_i)^{2}} \tag{3} \] In S, we take y and x as constants, allowing b and a to vary to find their values where S is at a minimum. We know that S is at a minimum where \({\delta S}/{\delta \hat{a}}=0\) and \({\delta S}/{\delta \hat{b}}=0\).

2.1 \(\frac{\delta S}{\delta \hat{b}}=0\)

\[\frac{\delta S}{\delta\hat{b}} = 2\sum_i {(y_{i}-\hat{a}-\hat{b}x_i)(-x_{i})}=0 \tag{4.1} \]

Divide both sides by -2, and carry out the multiplication within the summation sign:

\[\sum_i {(y_{i}x_i-\hat{a}x_i-\hat{b}x_i^{2})}=0 \tag{4.2} \] The expression can then be decomposed into three summations:

\[\sum_i {(y_{i}x_i)-\sum_i {\hat{a}x_i}-\sum_i {\hat{b}x_i^{2}}}=0 \tag{4.3} \] \(\hat{a}\) and \(\hat{b}\) are scalars, allowing us to pull them out of the summation signs:

\[\sum_i {(y_{i}x_i)-\hat{a}\sum_i {x_i}-\hat{b}\sum_i {x_i^{2}}}=0 \tag{4.4} \] The mean of x is \(\bar{x} = \frac{\sum_{i}{{x}}}{N}\). Thus, \(\sum_{i}{{x_{i}}=N\bar{x}}\) We can therefore simplify the above:

\[\sum_i {(y_{i}x_i)-\hat{a}N\bar{x}-\hat{b}\sum_i {x_i^{2}}}=0 \tag{4.5} \] Rearranging:

\[\hat{a}N\bar{x}=\sum_i {(y_{i}x_i)-\hat{b}\sum_i {x_i^{2}}} \tag{4.6} \]

And finally:

\[\hat{a}=\frac{\sum_i {(y_{i}x_i)-\hat{b}\sum_i {x_i^{2}}}}{N\bar{x}} \tag{4.7} \]

2.2 \(\frac{\delta S}{\delta \hat{a}}=0\)

We can repeat the above steps to find the optimal value of \(\hat{b}\).

\[\frac{\delta S}{\delta\hat{a}} = 2\sum_i {(y_{i}-\hat{a}-\hat{b}x_i)(-1)}=0 \tag{5.1} \]

\[-2\sum_i {(y_{i}-\hat{a}-\hat{b}x_i)}=0 \tag{5.2} \] Divide both sides by -2. The expression can then be decomposed into three summations, and the scalars then pulled out of each summation:

\[\sum_i {y_{i}-\hat{a}\sum_i{1}-\hat{b}\sum_i{x_i}}=0 \tag{5.3} \] Since \(\sum_{i}{y_i}=N\bar{y}\) and \(\sum_{i}{x_i}=N\bar{x}\) and \(\sum_{i}{1}=N\), we can simplify as follows:

\[N\bar{y}-\hat{a}N-\hat{b}N\bar{x}=0 \tag{5.4}\] Dividing both sides by N: \[\bar{y}-\hat{a}-\hat{b}\bar{x}=0 \tag{5.5}\]

Replacing \(\hat{a}\) with the formula from Equation (4.7):

\[\bar{y}-\frac{\sum_i {(y_{i}x_i)-\hat{b}\sum_i {x_i^{2}}}}{N\bar{x}}-\hat{b}\bar{x}=0\tag{5.6} \] Rearranging:
\[\frac{\sum_i{y_{i}x_{i}}}{N\bar{x}} + \hat{b} \left( \bar{x} - \frac{\sum_i{x_{i}^{2}}}{N\bar{x}} \right)=\bar{y} \tag{5.7} \]

Multiplying \(\bar{x}\) by \(\frac{N\bar{x}}{N\bar{x}}\):

\[\frac{\sum_i{y_{i}x_{i}}}{N\bar{x}} + \hat{b} \left( \frac{N\bar{x}^{2}-\sum_i{x_{i}^{2}}}{N\bar{x}} \right)=\bar{y} \tag{5.8} \] Multiply both sides by \(N\bar{x}\):

\[\sum_i{y_{i}x_{i}} + \hat{b} \left( N\bar{x}^{2}-\sum_i{x_{i}^{2}} \right)=N\bar{x}\bar{y} \tag{5.9} \]

\[\hat{b} \left( N\bar{x}^{2}-\sum_i{x_{i}^{2}} \right)=N\bar{x}\bar{y}-\sum_i{y_{i}x_{i}} \tag{5.10} \]

And finally,

\[\hat{b}=\frac{N\bar{x}\bar{y}-\sum_i{y_{i}x_{i}}}{N\bar{x}^{2}-\sum_i{x_{i}^{2}}} \tag{5.11} \]

3 Further things to note

3.1 OLS residuals sum to zero

Rewrite Equation (5.2) by dividing both sides by -2:

\[\sum_i {(y_{i}-\hat{a}-\hat{b}x_i)}=0 \tag{6.1} \] The expression inside the parentheses is simply the residuals \(\mu_i\), so the following must be true:

\[\sum_i {\mu_{i}}=0 \tag{6.2} \] Which means that the mean value of the residuals must always equal zero:
\[\bar\mu=\frac{1}{N} \sum_i {\mu_{i}}=0 \tag{6.3} \] Note that this result relies on the existence of a y-intercept term a in the regression model (Equation (5.2) is derived from \(\frac{\delta S}{\delta \hat{a}}=0\)). Including a y-intercept term is therefore necessary to guarantee that the residuals sum to zero.

3.2 OLS residuals are orthogonal to the regressors

In other words, the independent variable x (the regressor) is uncorrelated with (orthogonal to) the residuals \(\mu\). We show that using the formula for covariance between x and \(\mu\):

\[Cov(x,\mu)= \sum_i {(x_i-\bar{x})(\mu_i-\bar{\mu})} \tag{7.1} \] Since, from Equation (6.3), \(\bar\mu=0\), this can be simplified:

\[Cov(x,\mu)= \sum_i {(x_i-\bar{x})\mu_i}= \sum_i {x_i\mu_i}-\bar{x}\sum_i {\mu_i} \tag{7.2} \] From Equation (3), \(\sum_i {\mu_i}=0\); therefore: \[Cov(x,\mu)= \sum_i {x_i\mu_i} \tag{7.3} \] But from Equation (4.1) we know: \[\sum_i {(y_{i}-\hat{a}-\hat{b}x_i)(x_{i})}=\sum_i {x_i\mu_i}= 0 \tag{7.4} \] Therefore, \(Cov(x,\mu)=0\), which means that the independent variable x is orthogonal to the residuals \(\mu\).