# Simple Linear Regression

3.1 Introduction

In this chapter, we study extensively the estimation of a linear relationship between two vari­ables, Yi and Xi, of the form:

Yi = a + вХі + Ui i = 1, 2,…,n (3.1)

where Yi denotes the i-th observation on the dependent variable Y which could be consumption, investment or output, and Xi denotes the i-th observation on the independent variable X which could be disposable income, the interest rate or an input. These observations could be collected on firms or households at a given point in time, in which case we call the data a cross-section. Alternatively, these observations may be collected over time for a specific industry or country in which case we call the data a time-series. n is the number of observations, which could be the number of firms or households in a cross-section, or the number of years if the observations are collected annually. a and в are the intercept and slope of this simple linear relationship between Y and X. They are assumed to be unknown parameters to be estimated from the data. A plot of the data, i. e., Y versus X would be very illustrative showing what type of relationship exists empirically between these two variables. For example, if Y is consumption and X is disposable income then we would expect a positive relationship between these variables and the data may look like Figure 3.1 when plotted for a random sample of households. If a and в were known, one could draw the straight line (a + вХ) as shown in Figure 3.1. It is clear that not all the observations (Xi, Yi) lie on the straight line (a + вХ). In fact, equation (3.1) states that the difference between each Yi and the corresponding (a + вXi) is due to a random error ui. This error may be due to (i) the omission of relevant factors that could influence consumption, other than disposable income, like real wealth or varying tastes, or unforseen events that induce households to consume more or less, (ii) measurement error, which could be the result of households not reporting their consumption or income accurately, or (iii) wrong choice of a linear relationship between consumption and income, when the true relationship may be nonlinear. These different causes of the error term will have different effects on the distribution of this error. In what follows, we consider only disturbances that satisfy some restrictive assumptions. In later chapters we relax these assumptions to account for more general kinds of error terms.

In real life, a and в are not known, and have to be estimated from the observed data {(Xi, Y) for i = 1,2,…, n}. This also means that the true line (a + вX) as well as the true disturbances (the ui’s) are unobservable. In this case, a and в could be estimated by the best fitting line through the data. Different researchers may draw different lines through the same data. What makes one line better than another? One measure of misfit is the amount of error from the observed Yi to the guessed line, let us call the latter Yi = Y + вXi, where the hat (~) denotes a guess on the appropriate parameter or variable. Each observation (Xi, Yi) will have a cor­responding observable error attached to it, which we will call ei = Yi — Yi, see Figure 3.2. In other words, we obtain the guessed Yi, (YYi) corresponding to each Xi from the guessed line,

B. H. Baltagi, Econometrics, Springer Texts in Business and Economics, DOI 10.1007/978-3-642-20059-5_3, 49