In the plot above, one has data for two characteristics (x and y) for 500 observations (the number of red points). The cloud of points clearly indicates that higher values of x correspond to higher values of y. Ordinary Least Squares (OLS) is a technique for turning that correspondence into the best-fitting linear equation \[\hat{y} =\hat{a}+\hat{b}x\tag{1}\] achieved by minimizing \(\sum_i{(y_{i}-\hat{y_{i}})^{2}}\), where each i is one of the 500 observations. That is, one finds the values of \(\hat{a}\) (the y-intercept) and \(\hat{b}\) (the slope) that lead to the smallest sum of squared differences between the actual values of y and the estimated values \(\hat{y}\) produced using \(\hat{a}\) and \(\hat{b}\). Minimizing the sum of squares, hence the name least squares.
\[\min_{\hat{a},\hat{b}} \sum_i {(y_{i}-(\hat{a}+\hat{b}x_i))^{2}}\tag{2} \] The OLS estimator of the parameters of a linear equation has three features I would like to call to your attention:
Slightly rewrite Equation (2):
\[S =\sum_i {(y_{i}-\hat{a}-\hat{b}x_i)^{2}} \tag{3} \] In S, we take y and x as constants, allowing b and a to vary to find their values where S is at a minimum. We know that S is at a minimum where \({\delta S}/{\delta \hat{a}}=0\) and \({\delta S}/{\delta \hat{b}}=0\).
\[\frac{\delta S}{\delta\hat{b}} = 2\sum_i {(y_{i}-\hat{a}-\hat{b}x_i)(-x_{i})}=0 \tag{4.1} \]
Divide both sides by -2, and carry out the multiplication within the summation sign:
\[\sum_i {(y_{i}x_i-\hat{a}x_i-\hat{b}x_i^{2})}=0 \tag{4.2} \] The expression can then be decomposed into three summations:
\[\sum_i {(y_{i}x_i)-\sum_i {\hat{a}x_i}-\sum_i {\hat{b}x_i^{2}}}=0 \tag{4.3} \] \(\hat{a}\) and \(\hat{b}\) are scalars, allowing us to pull them out of the summation signs:
\[\sum_i {(y_{i}x_i)-\hat{a}\sum_i {x_i}-\hat{b}\sum_i {x_i^{2}}}=0 \tag{4.4} \] The mean of x is \(\bar{x} = \frac{\sum_{i}{{x}}}{N}\). Thus, \(\sum_{i}{{x_{i}}=N\bar{x}}\) We can therefore simplify the above:
\[\sum_i {(y_{i}x_i)-\hat{a}N\bar{x}-\hat{b}\sum_i {x_i^{2}}}=0 \tag{4.5} \] Rearranging:
\[\hat{a}N\bar{x}=\sum_i {(y_{i}x_i)-\hat{b}\sum_i {x_i^{2}}} \tag{4.6} \]
And finally:
\[\hat{a}=\frac{\sum_i {(y_{i}x_i)-\hat{b}\sum_i {x_i^{2}}}}{N\bar{x}} \tag{4.7} \]
We can repeat the above steps to find the optimal value of \(\hat{b}\).
\[\frac{\delta S}{\delta\hat{a}} = 2\sum_i {(y_{i}-\hat{a}-\hat{b}x_i)(-1)}=0 \tag{5.1} \]
\[-2\sum_i {(y_{i}-\hat{a}-\hat{b}x_i)}=0 \tag{5.2} \] Divide both sides by -2. The expression can then be decomposed into three summations, and the scalars then pulled out of each summation:
\[\sum_i {y_{i}-\hat{a}\sum_i{1}-\hat{b}\sum_i{x_i}}=0 \tag{5.3} \] Since \(\sum_{i}{y_i}=N\bar{y}\) and \(\sum_{i}{x_i}=N\bar{x}\) and \(\sum_{i}{1}=N\), we can simplify as follows:
\[N\bar{y}-\hat{a}N-\hat{b}N\bar{x}=0 \tag{5.4}\] Dividing both sides by N: \[\bar{y}-\hat{a}-\hat{b}\bar{x}=0 \tag{5.5}\]
Replacing \(\hat{a}\) with the formula from Equation (4.7):
\[\bar{y}-\frac{\sum_i {(y_{i}x_i)-\hat{b}\sum_i {x_i^{2}}}}{N\bar{x}}-\hat{b}\bar{x}=0\tag{5.6} \] Rearranging:
\[\frac{\sum_i{y_{i}x_{i}}}{N\bar{x}} + \hat{b} \left( \bar{x} - \frac{\sum_i{x_{i}^{2}}}{N\bar{x}} \right)=\bar{y} \tag{5.7} \]
Multiplying \(\bar{x}\) by \(\frac{N\bar{x}}{N\bar{x}}\):
\[\frac{\sum_i{y_{i}x_{i}}}{N\bar{x}} + \hat{b} \left( \frac{N\bar{x}^{2}-\sum_i{x_{i}^{2}}}{N\bar{x}} \right)=\bar{y} \tag{5.8} \] Multiply both sides by \(N\bar{x}\):
\[\sum_i{y_{i}x_{i}} + \hat{b} \left( N\bar{x}^{2}-\sum_i{x_{i}^{2}} \right)=N\bar{x}\bar{y} \tag{5.9} \]
\[\hat{b} \left( N\bar{x}^{2}-\sum_i{x_{i}^{2}} \right)=N\bar{x}\bar{y}-\sum_i{y_{i}x_{i}} \tag{5.10} \]
And finally,
\[\hat{b}=\frac{N\bar{x}\bar{y}-\sum_i{y_{i}x_{i}}}{N\bar{x}^{2}-\sum_i{x_{i}^{2}}} \tag{5.11} \]
Rewrite Equation (5.2) by dividing both sides by -2:
\[\sum_i {(y_{i}-\hat{a}-\hat{b}x_i)}=0 \tag{6.1} \] The expression inside the parentheses is simply the residuals \(\mu_i\), so the following must be true:
\[\sum_i {\mu_{i}}=0 \tag{6.2} \] Which means that the mean value of the residuals must always equal zero:
\[\bar\mu=\frac{1}{N} \sum_i {\mu_{i}}=0 \tag{6.3} \] Note that this result relies on the existence of a y-intercept term a in the regression model (Equation (5.2) is derived from \(\frac{\delta S}{\delta \hat{a}}=0\)). Including a y-intercept term is therefore necessary to guarantee that the residuals sum to zero.
In other words, the independent variable x (the regressor) is uncorrelated with (orthogonal to) the residuals \(\mu\). We show that using the formula for covariance between x and \(\mu\):
\[Cov(x,\mu)= \sum_i {(x_i-\bar{x})(\mu_i-\bar{\mu})} \tag{7.1} \] Since, from Equation (6.3), \(\bar\mu=0\), this can be simplified:
\[Cov(x,\mu)= \sum_i {(x_i-\bar{x})\mu_i}= \sum_i {x_i\mu_i}-\bar{x}\sum_i {\mu_i} \tag{7.2} \] From Equation (3), \(\sum_i {\mu_i}=0\); therefore: \[Cov(x,\mu)= \sum_i {x_i\mu_i} \tag{7.3} \] But from Equation (4.1) we know: \[\sum_i {(y_{i}-\hat{a}-\hat{b}x_i)(x_{i})}=\sum_i {x_i\mu_i}= 0 \tag{7.4} \] Therefore, \(Cov(x,\mu)=0\), which means that the independent variable x is orthogonal to the residuals \(\mu\).