In regression analysis we have two types of variables:

i) dependent (or explained) variable, and ii) independent (or explanatory) variable. As the name (explained and explanatory) suggests the dependent variable is explained by the independent variable. In the simplest case of regression analysis there is one dependent variable and one independent variable. Let us assume that consumption expenditure of a household is related to the household income. For example, it can be postulated that as household income increases, expenditure also increases. Here, consumption expenditure is the dependent variable and household income is the independent variable.

Usually we denote the dependent variable as Y and the
independent variable as X. Suppose we took up a household survey and collected
n pairs of observations in X and Y. The next step is to find out the nature of relationship
between X and Y The relationship between X and Y can take many forms. The
general practice is to express the relationship in terms of some mathematical
equation. The simplest of these equations is the linear equation. This means
that the relationship between X and Y is in the form of a straight line and is
termed linear regression. When the equation represents curves (not a straight
line) the regression is called non-linear or curvilinear.

Now the question arises, ‘How do we identify the equation
form?’

There is no hard and fast rule as such. The form of the
equation depends upon the reasoning and assumptions made by us. However, we may
plot the X and Y variables on a graph paper to prepare a scatter diagram. From
the scatter diagram, the location of the points on the graph paper helps in
identifying the type of equation to be fitted. If the points are more or less
in a straight line, then linear equation is assumed. On the other hand, if the
points are not in a straight line and are in the form of a curve, a suitable
non-linear equation (which resembles the scatter) is assumed. We have to take
another decision, that is, the identification of dependent and independent
variables. This again depends on the logic put forth and purpose of analysis:
whether ‘Y depends on X’ or ‘X depends on Y’.

Thus there can be two regression equations from the same set
of data. These are: i) Y is assumed to be dependent on X (this is termed ‘Y on
X’ line), and ii) X is assumed to be dependent on Y (this is termed ‘X on Y’
line). You may by now be wondering why the term ‘regression’, which means ‘reduce’.
This name is associated with a phenomenon that was observed in a study on the
relationship between the stature of father (x) and son (y). It was observed that the average stature of
sons of the tallest fathers has a tendency to be less than the average stature
of these fathers. On the other hand, the average stature of sons of the
shortest fathers has a tendency to be more than the average stature of these
fathers. This phenomenon was called regression towards the mean. Although this
appeared somewhat strange at that time, it was found later that this is due to
natural variation within subgroups of a group and the same phenomenon occurred
in most problems and data sets.

The explanation is that many tall men come from families with
average stature due to vagaries of natural variation and they produce sons who
are shorter than them on the whole. A similar phenomenon takes place at the lower
end of the scale. Let us discuss simple
linear regression.