1 Try downloading this webpage

The functionality of this webpage is constrained in D2L, and you might find it easier to read and navigate if you download this html file to your own computer.

2 Resources to learn R

3 Possibly helpful resource regarding probability

Intuition builder for probability and statistics

4 The error term

We think of the error term as a random normal variable, with a mean of zero. In Equation (1), the coefficients \(\alpha_0\) and \(\alpha_1\), as well as the independent variable \(x_i\), are not random, they are fixed values. But \(\epsilon_i\) is a random normal variate. \[y_{i} = \alpha_{0} + \alpha_{1} x_{i} + \epsilon_{i} \tag{1}\] The randomness of \(\epsilon_i\) has important consequences: when we estimate the coefficients \(\alpha_0\) and \(\alpha_1\), the uncertainty introduced into the equation by \(\epsilon_i\) passes to the estimated coefficients. Thus, these estimates are not fixed values, they are randomly distributed.

We can show this by simulation. The following script starts out with known values \(\alpha_0\), \(\alpha_1\), and \(x_i\). We then create \(\epsilon_i\), by drawing values from the standard normal distribution (this has a mean of zero). With these elements, we create values of \(y_i\). Now, with values for our dependent and independent variables, we can use the lm(y~x) function to estimate the values of \(\alpha_0\) and \(\alpha_1\). We keep these estimated coefficients, as well as their estimated standard errors.

The above process is repeated 5000 times, each time retaining the estimated coefficients and standard errors.

x<-runif(50)*20 # creating our x variable, this is randomly created, but done only once, creating 50 values between 0 and 20
cf<-se<-NULL # creating two empty objects to store results below
for (ix in 1:5000){
  y<-100-1*x+rnorm(50)*3 # the coefficients (intercept= 100, slope= -1) and x are the same in each iteration -- only the error changes -- the error is random normal, with a mean of zero and a standard deviation of 3
  q<-lm(y~x)
  se<-rbind(se,diag(vcov(q))^.5) # we collect in each iteration the estimated standard errors
  cf<-rbind(cf,coef(q)) # we collect in each iteration the estimated coefficients
}

The following gif shows basically what is happening at each iteration of the simulation. Though coefficients and independent variables are fixed, the error term is random, causing estimates to be imprecise

# The mean values of the coefficients should be close to the true values (intercept= 100, slope= -1) 
apply(cf,2,mean)
## (Intercept)           x 
##   100.00406    -1.00029
# The standard deviations of the coefficients can be taken as the true values of the standard errors
apply(cf,2,sd)
## (Intercept)           x 
##  0.79846587  0.07588065
# The mean values of the standard errors should be close to the true values of the standard errors
apply(se,2,mean)
## (Intercept)           x 
##  0.81559469  0.07584552

Histograms are a good way to visualize simulation results. We have 5000 estimates of the slope coefficient and a histogram allows one to see if the histogram centers on the true value.

# A histogram of the estimated slope coefficients
hist(cf[,2],breaks=50)
abline(v=-1,col="red",lwd=3) # The solid red line shows the true value of the slope coefficient (-1)
abline(v=mean(cf[,2]),col="blue",lwd=1,lty=2) # The dashed blue line shows the mean value of the estimated slope coefficients

The takeaway here is that the estimated coefficient is just a guess, which could be drawn from anywhere along the x-axis on the histogram, but which is much more likely to come from the middle of the distribution than from the tails.

5 The standard error

A standard error is simply the standard deviation of a statistic. Since the estimated coefficient is a statistic, the standard deviation of the estimated coefficient is called a standard error. It is a measure of the spread of the distribution shown in the histogram above. The formula for the estimated standard error, in a regression with just one independent variable, is:

\[\widehat{se}(\hat\alpha_1)=\sqrt\frac{\hat\sigma^2}{\Sigma(x_{i}-\bar x)^2}~~~~where~~\hat\sigma^2=\frac{\Sigma\hat\epsilon_{i}^2}{n-2} \tag 2\]

There are just two pieces of information used to estimate the standard error for a regression with one independent variable: the variance of the residuals (\(\hat\sigma^2\)); and the variation in the independent variable (\(\Sigma(x_{i}-\bar x)^2\)).

We use the standard error, along with the estimated coefficient, when we create a t-statistic.

5.1 The variance-covariance matrix

A standard deviation is simply the square root of a variance. R produces the variances and covariances for the estimated parameters in a matrix, where the diagonal is the variance of each estimated coefficient, and the off-diagonal elements are the covariances between the row and column estimated coefficients. One can view the variance-covariance matrix by using the function vcov(q), where \(q\) is an lm object.

The easiest way to extract standard errors is to take the square root of the diagonal of the variance-covariance matrix: se<-diag(vcov(q))^.5.

6 Hypothesis Testing

6.1 Steps for conducting a hypothesis test:

  1. Set up a null hypothesis (i.e., posit that the true value is equal to a specific number).
  2. Create a test statistic that embodies that null hypothesis.
  3. Find the p-value (the probability that the null hypothesis is true).
  4. If the p-value is low enough (typically \(\leq 0.05\)), then reject the null hypothesis.

6.2 t-test for regression coefficients

  1. The null hypothesis automatically shown in your R output is \(H_0:\alpha_1=0\). This makes sense because we are interested in whether the independent variable has an effect on the dependent variable; our \(H_0\) is that it does not have an effect.
  2. We set up the t-statistic numerator as the estimated coefficient minus the hypothesized value; the denominator is the estimated standard error of the estimated coefficient. t-stat\(=\frac{\hat\alpha_1-0}{\widehat{se}(\hat\alpha_1)}\). This t-stat is a point on the t-distribution (for example, the blue marker in the figure below). Note that if \(\hat\alpha_1\) really does equal the hypothesized value, then the calculated t-stat would equal zero.
  3. The p-value reported in the lm summary is p-value \(=Pr(\geq |t|)\), which is the probability of getting a value from the t-distribution greater than or equal to the absolute value of the calculated t-stat. R handles this as a two-tailed test, as shown in the figure below, where the total shaded area is 0.10 of the area under the probability density function curve.
  4. If the p-value is \(\leq 0.10\) we reject \(H_0\) (because of multicollinearity, we generally use higher p-value cutoffs for regression coefficients than for other statistical tests). Any t-stat in the red shaded area of the figure would be in the 0.10 of the distribution furthest away from zero. The location of our blue marker indicates that we should reject the hull hypothesis.

T-distribution, with the tails shaded for a 0.10 size of test

The R function pt will calculate the p-value for a t-statistic. However, one also needs to know the degrees of freedom (df). In a regression model, df = number of observations - number of estimated parameters. In the example below, we compare the results from the lm summary with the p-value calculated using pt.

u<-data.frame(readxl::read_xlsx("USstates.xlsx",sheet="USstates")) # read our USstates data
q<-lm(ginicoef~popsqmi,u) # regress inequality (Gini coefficient) on population density
se<-diag(vcov(q))^.5 # extract the diagonal from the variance-covariance matrix, then take its square root -- this gives the standard errors
cf<-coef(q) # extract the coefficient estimates
tst<-cf/se # the ratio of the estimated coefficient to the estimated standard error is the t-stat
dfr<-nrow(u)-length(cf) # the degrees of freedom equals the number of observations minus the number of estimated coefficients
data.frame(df=dfr,nobs=nrow(u),estparam=length(cf)) # display these values
##   df nobs estparam
## 1 46   48        2
pv<-round(pt(abs(tst),dfr,lower.tail=FALSE),4) # we calculate the area of a t-distribution with df=dfr, to the right of the absolute value of our t-stat; we then round our result to four decimal places. This is the p-value for a one-tail test
data.frame(cf,se,tst,pval=2*pv) # we display our values, multiplying the one-tail test p-value by 2 to get the two-tail p-value; this should exactly match the results below
##                       cf           se        tst   pval
## (Intercept) 4.610925e-01 3.337937e-03 138.136977 0.0000
## popsqmi     2.554528e-05 9.930118e-06   2.572506 0.0134
summary(q) # results should equal those above
## 
## Call:
## lm(formula = ginicoef ~ popsqmi, data = u)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.055712 -0.011006  0.000077  0.013867  0.051078 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.611e-01  3.338e-03 138.137   <2e-16 ***
## popsqmi     2.555e-05  9.930e-06   2.573   0.0134 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01839 on 46 degrees of freedom
## Multiple R-squared:  0.1258, Adjusted R-squared:  0.1068 
## F-statistic: 6.618 on 1 and 46 DF,  p-value: 0.01339

6.3 F-test for regression coefficients

The t-test will tell us if, individually, regression coefficients are indistinguishable from zero. However, if we are testing whether two or more are equal to zero, we need to use an F-test. We usually do this when trying to remove irrelevant variables from the model.

  1. Run a regression in which you include all the independent variables that you think—a priori—are relevant. This is the unrestricted regression.
  2. Store the sum of squared residuals (\(ESS_{UR}=\Sigma\epsilon_{i}^2\)) and the degrees of freedom (\(df_{UR}\)) from this regression.
  3. Make a note of the variables which have a t-stat p-value above 0.10.
  4. Set up a new regression, which omits all those independent variables with a high p-value. This is the restricted regression.
  5. Store the sum of squared residuals (\(ESS_{R}=\Sigma\epsilon_{i}^2\)) from this regression.
  6. Use the stored values to calculate the F-stat below; \(k\) is the number of restrictions (the number of variables dropped).

\[Fstat=\frac{(\frac{ESS_R-ESS_{UR}}{k})}{(\frac{ESS_{UR}}{df_{UR}})} \] This statistic has numerator degrees of freedom equal to \(k\), and denominator degrees of freedom equal to \(df_{UR}\).

6.3.1 Steps for conducting an F-test on restrictions

  1. \(H_0:\) the dropped variables do not belong in the model (the coefficients for the dropped variables are all equal to zero).
  2. We calculate the F-stat, as above.
  3. Using the F-stat, and the degrees of freedom, we calculate the p-value.
  4. If the F-stat p-value is \(\leq .05\) we reject \(H_0\).