Types of Variables
Regression only works when the variable we are explaining (our
Y) is of a certain ilk. Below I outline the four types of variables
and their characteristics (the latter three are often called continuous
Regression and correlation are generally only appropriate for
interval and ratio variables. There are other tests that we will
get into later in the semester that are appropriate for nominal
and ordinal variables. Note that you can always turn a "higher
order" variable (i.e. interval/ratio) into a lower order
one - for instance, we can take income data and make it into a
poverty indicator that is ordinal (i.e. extremely poor, poor,
near poor, middle class and well-off). In general we do not want
to do this since we are eliminating information. However, we
may have strong substantive (theoretical) reasons for such a move.
We may believe that there are clear class cleavages in American
society, for instance, that are obscured when we look at income
as a ratio variable.
Regression is a linear technique in which Y is related to X as:
b is the slope (or the slope estimate) - it is the average change
in y associated with a one unit increase in X. b does not change
for a given regression (i.e. it is a constant)
a is the intercept; a is the value of Y when X equals zero. That
is, it is the value at which the line crosses the X axis: the
distance from that intersection to the origin (the origin is the
zero point on both axes).
Both a and b are called regression coefficients.
If all the observations (cases / points) fall on a straight line,
then it is an exact relationship within what is called a deterministic
model. If they do not fall on a straight line then it is an inexact
relationship and we call it a probabilistic model. There are
several reasons why sociological data do not fall on a straight
R2 is also known as the "coefficient of determination";
the percentage of variance in our outcome measure that we have
explained through our equation (goodness of fit); thus 1- R2
is the unexplained variation. R2 is equal to :
regression sum of squares / total sum of squares or RSS / TSS
R2 = (yHAT-yMEAN)2 / (y-yMEAN)2
The left over variation is the error sum of squares and can be
ESS = (y-yHAT)2
Thus, the total sum of squares can be broken down into the component
we predicted RSS and that which is left over. The logic is the
following: Starting with no predictor variables, if we had to
guess what someone's value on a particular measure was, assuming
that it was normally distributed, we would guess the mean. Given
this, we are trying to do better than our initial guess of the
mean, so our calculations of variation in the first equation above
are given with respect to the mean value.
With a given R2, we don't know whether it is substantively
or statistically important. That is, we don't know whether we
should say "so what" to an R2 of .5 or one
of .1 or whatever. The way we decide substantively is up to you.
The way we decide statistically is with a test that tells us
what the probability of that result coming from randomness would
be. This test is called an F-test (F comes from the statistician
Fischer who discovered it and the distribution associated with
It is simply the mean RSS over the mean ESS. This boils down
to the following equation:
F = regression mean square / mean square error
The mean square is not exactly just taking the RSS or the ESS
over the N; rather it is over the degrees of freedom (a concept
that we will go into later in the semester, but just think of
it as a magic wand right now). The degrees of freedom varies
depending on what operation we are talking about. For the case
of the F-test it is:
Regression mean square = (yHAT-yMEAN)2 /
Mean square error = (y-yHAT)2 / [n-(k+1)]
In these equations N is the sample size and K is the number of
predictors. It is worthwhile to know this general form for when
we use multiple predictor models, but for now, in our simple case
of a one predictor model, k is 1 so that the equations boil down
Regression mean square = (yHAT-yMEAN)2
Mean square error = (y-yHAT)2 / (n-2)
We divide the top line by the bottom and we get the F-statistic.
Then we go to the chart in our text book and look up the probability
of getting that F-statistic by chance given our degrees of freedom.
So with 18 countries and 1 predictor, we look up F1,18.
Normally, sociologists use the cut off of p<.05 to reject
the null hypothesis (the regression equation is statistically
significant). Sometimes, with small samples, we may use a more
lenient cut off of p<.10.
In the simple case of one predictor, the overall statistical significance
of the model will tell us the significance of the predictor.
But in the more general case, each coefficient (b) will have its
own test of significance. We will get into that later as well.
But for now, it suffices to say that in the one variable model
(you may notice that the t-statistic (the equivalent for the f-statistic
for the particular coefficient, based on the t-distribution which
we will also discuss in the future) is the square root of the
F-stat. The probabilities are the same
The issue of probabilities raises the two types of error that
we may commit in evaluating our hypothesis (as opposed to specification
If H0 = null hypothesis = there is no true relationship between the variables
(any observed association is due to chance)
Type One Error: () Falsely rejecting H0 when
it is actually the case. That is, our results with a .04 F-stat
probability will actually be committing this type of error 4 out
of 100 times. That's why a lower p score makes us more confident.
However, we can get a very strong f-stat (and low p score) when
we have a completely spurious relationship. So true causation
is a matter of interpretation or substantive / theoretical significance.
Type Two Error: ( = 1- ) Falsely accepting the H0
when it is not the case. That is, our F-stat did not see the
difference we suspected, but there actually is one. This may
result from specification error .
Regression Limitations and Diagnostics
Regression operates under a number of assumptions. However, it
is considered a robust technique. That is, you can violate
these assumptions to a limited extent and still get away with
it. They are the following:
In addition to the basic assumptions, we want to be wary of influential
cases that affect our results. Cases may be influential if they
are outliers on either the x or y axis. Certain outliers
may have more leverage on the overall equation than others,
however, depending on where they fall on the X axis. Sometimes
the s.d. of X is called the leverage of the X values on b because
increasing sd of x increases the accuracy of b. For example,
if we were measuring the relationship between SAT scores and
college performance, we may have low leverage since only those
kids who had high SATs actually went to colelge and thus were
included in our study. But if one kid with low SATs got in, he
might be a special case and have a huge affect on our equation;
that is, one case far out will have lots of leverage on the line
and may be considered for removal
The sensitivity of the
regression equation to outliers also depends on sample size: A
large N will be relatively unaffected by a few weird cases, but
a small one obviously will.
DFITS: (difference between fitted values) with the ith
case in and the ith case dropped then standardized into z-scores
(remember: the area under the curve). Any value over 2 times the
square root of (p+1)/n; where p = number of variables (k) are
influential and are candidates for being removed. However, the
best way to detect naughty cases is to plot these values in a
normal quantile plot to visually detect influential cases.
Studentized or standardized residuals: Residuals have a
mean of zero; therefore if we divide them by their standard deviations
we have standardized them. Since the standard deviations (like
the line itself) are affected by the extreme cases, we divide
by the standard deviation without that case included. Again,
a quantile plot is the best way to pick out the bad apples. But
as a rule of thumb > 2 is a rotting apple and > 3 is rotted
to the core.
I prefer DFITS since it tells more - it tells leverage not just outlierness; see figures on p. 165.