Lecture Three

Types of Variables

Regression only works when the variable we are explaining (our Y) is of a certain ilk. Below I outline the four types of variables and their characteristics (the latter three are often called continuous variables).

Indicates a Difference, without any implied ordering
Religion: 1=catholic; 2=protestant; 3=Jewish; 4=muslim; 5=other

Indicates a difference, and indicates the direction of the difference

(e.g., more or less than)
Attitude on a Subject:

1=strongly disagree, 2=disagree; 3=don't care / don't know; 4=agree; 5=strongly agree

Indicates a Difference, with directionality and amount of difference in equal intervals

Temperature in Celsius

Occupational Prestige


Indicates a difference; indicates the direction of the difference, indicates the amount of the difference in equal intervals; indicates an absolute zero

Temperature in Kelvin


Years of Schooling

Regression and correlation are generally only appropriate for interval and ratio variables. There are other tests that we will get into later in the semester that are appropriate for nominal and ordinal variables. Note that you can always turn a "higher order" variable (i.e. interval/ratio) into a lower order one - for instance, we can take income data and make it into a poverty indicator that is ordinal (i.e. extremely poor, poor, near poor, middle class and well-off). In general we do not want to do this since we are eliminating information. However, we may have strong substantive (theoretical) reasons for such a move. We may believe that there are clear class cleavages in American society, for instance, that are obscured when we look at income as a ratio variable.

Regression Summary:

Regression is a linear technique in which Y is related to X as: y=a+bX

b is the slope (or the slope estimate) - it is the average change in y associated with a one unit increase in X. b does not change for a given regression (i.e. it is a constant)

a is the intercept; a is the value of Y when X equals zero. That is, it is the value at which the line crosses the X axis: the distance from that intersection to the origin (the origin is the zero point on both axes).

Both a and b are called regression coefficients.

If all the observations (cases / points) fall on a straight line, then it is an exact relationship within what is called a deterministic model. If they do not fall on a straight line then it is an inexact relationship and we call it a probabilistic model. There are several reasons why sociological data do not fall on a straight line:

  1. Measurement Error: we cannot measure Y and/or X exactly right due to the imprecision of our instruments/methods (accounting for this fact is the origins of statistics from astronomy - known as the error curve; see Porter 1986).

  1. Inherent Variability: Even if we were to measure things perfectly, social systems do not act as consistently as gravity or other natural phenomena that may follow a deterministic model. This may partly result from the fact that we have social actors (cases) who are aware of the categories of measurement and change their behavior accordingly (i.e. teachers teach the test when we rely on standardized exams). It could result from the fact that sociology has a particularly acute case of a - observation affects the outcome (Schroedinger's box). Lastly, it may be something fundamental about human will and free agency that there is a bit of randomness… this is best left to the philosophers and theologians.

  1. Specification Error: Since most dependent (response) variables are caused by more than one variable it may be the case that we are constructing the wrong equation. (e.g., We need to measure education to predict income, not parental income.)

  1. Non-linearity: Finally, it may be the case that we have specified the correct variables, but their relationship is not a linear one. For example, the effect of income on years of schooling may strong, but not be linear. We should not expect an additional twenty thousand dollars of family income to make as much of a difference for a family in the two hundred thousand dollar range as we would in the $20,000 ballpark. We can try to make our variables linear (through log transformations, etc.) or we can specify a non-linear (quadratic or higher order equation) by including higher power terms such that the equation looks like: y=a+b1x+ b2x2. We will address how to do this later in the semester when we take up multiple regression (regression with more than one predictor).


R2 is also known as the "coefficient of determination"; the percentage of variance in our outcome measure that we have explained through our equation (goodness of fit); thus 1- R2 is the unexplained variation. R2 is equal to :

regression sum of squares / total sum of squares or RSS / TSS

R2 = (yHAT-yMEAN)2 / (y-yMEAN)2

The left over variation is the error sum of squares and can be represented by;

ESS = (y-yHAT)2

Thus, the total sum of squares can be broken down into the component we predicted RSS and that which is left over. The logic is the following: Starting with no predictor variables, if we had to guess what someone's value on a particular measure was, assuming that it was normally distributed, we would guess the mean. Given this, we are trying to do better than our initial guess of the mean, so our calculations of variation in the first equation above are given with respect to the mean value.


With a given R2, we don't know whether it is substantively or statistically important. That is, we don't know whether we should say "so what" to an R2 of .5 or one of .1 or whatever. The way we decide substantively is up to you. The way we decide statistically is with a test that tells us what the probability of that result coming from randomness would be. This test is called an F-test (F comes from the statistician Fischer who discovered it and the distribution associated with it).

It is simply the mean RSS over the mean ESS. This boils down to the following equation:

F = regression mean square / mean square error

The mean square is not exactly just taking the RSS or the ESS over the N; rather it is over the degrees of freedom (a concept that we will go into later in the semester, but just think of it as a magic wand right now). The degrees of freedom varies depending on what operation we are talking about. For the case of the F-test it is:

Regression mean square = (yHAT-yMEAN)2 / k

Mean square error = (y-yHAT)2 / [n-(k+1)]

In these equations N is the sample size and K is the number of predictors. It is worthwhile to know this general form for when we use multiple predictor models, but for now, in our simple case of a one predictor model, k is 1 so that the equations boil down to:

Regression mean square = (yHAT-yMEAN)2

Mean square error = (y-yHAT)2 / (n-2)

We divide the top line by the bottom and we get the F-statistic. Then we go to the chart in our text book and look up the probability of getting that F-statistic by chance given our degrees of freedom. So with 18 countries and 1 predictor, we look up F1,18. Normally, sociologists use the cut off of p<.05 to reject the null hypothesis (the regression equation is statistically significant). Sometimes, with small samples, we may use a more lenient cut off of p<.10.

In the simple case of one predictor, the overall statistical significance of the model will tell us the significance of the predictor. But in the more general case, each coefficient (b) will have its own test of significance. We will get into that later as well. But for now, it suffices to say that in the one variable model (you may notice that the t-statistic (the equivalent for the f-statistic for the particular coefficient, based on the t-distribution which we will also discuss in the future) is the square root of the F-stat. The probabilities are the same…

The issue of probabilities raises the two types of error that we may commit in evaluating our hypothesis (as opposed to specification error):

If H0 = null hypothesis = there is no true relationship between the variables

(any observed association is due to chance)

Type One Error: () Falsely rejecting H0 when it is actually the case. That is, our results with a .04 F-stat probability will actually be committing this type of error 4 out of 100 times. That's why a lower p score makes us more confident. However, we can get a very strong f-stat (and low p score) when we have a completely spurious relationship. So true causation is a matter of interpretation or substantive / theoretical significance.

Type Two Error: ( = 1- ) Falsely accepting the H0 when it is not the case. That is, our F-stat did not see the difference we suspected, but there actually is one. This may result from specification error .

Regression Limitations and Diagnostics

Regression operates under a number of assumptions. However, it is considered a robust technique. That is, you can violate these assumptions to a limited extent and still get away with it. They are the following:

  1. No measurement error: we always violate this, but the best time to think about this is when you are designing your study / survey. Does counting the number of punitive or restituitive laws (as Durkheim did in the Division of Labor) a good method for evaluating the type of social solidarity in a given society? Is asking the number of children in a household really measuring family size? We can test for the validity and reliability of our measures through pilot studies and other means such as instrumental variable approaches that will be the subject of next week's Thursday lecture.

  1. The independent variables are uncorrelated with the error term: What if we are predicting income by number of children and we notice that the error from our regression plot (i.e. the residuals) follows a bow shape (see Figure 2.19b in the text, page 157). Then are error is associated with our predictor. A bow shape indicates that a curvilinear form would be more appropriate, perhaps, y=a+bX2. What if the error (residual) increases with the number of children we observe? That is, we are good at predicting family income differences between one and two kid families and bad at predicting it between seven and eight kid families? Maybe this results from varied interpretations of what constitutes income among big and small families (i.e. measurement error); maybe it results from a "lurking variable" If the residuals follow a fan / cone shape then we have heteroscedasticity (see Figure 2.19c in the textbook). This often occurs when the unit of analysis in our data is aggregated (as it is in the LIS data). For the big (and rich) countries, (i.e. USA) with big (and more accurate) samples, we can measure family size and income more accurately, but with the small countries we cannot. This would result in systematic variance in the error terms. Even with non-aggregated data, it may be that there are more small families to measure and thus we can get more accurate estimates than for the large ones, so the residuals are greater at that end of the spectrum. To test for this we plot the residuals against the Y value and then against the X value to look for patterns… How do we correct it? There may be some other systematic effect going on, like a lurking variable affecting our results (such as country size). Often a transformation (i.e. logging) will do the trick or dividing our data up by what we think may be the lurking variable, but there are other ways that we will address later in the semester such as multiple regression and interaction terms.

  1. Lack of Multicolinearity: This only applies when we have more than one predictor. It is the assumption that they are not too highly correlated… We do not need to address this possibility until we get to multiple regression.

In addition to the basic assumptions, we want to be wary of influential cases that affect our results. Cases may be influential if they are outliers on either the x or y axis. Certain outliers may have more leverage on the overall equation than others, however, depending on where they fall on the X axis. Sometimes the s.d. of X is called the leverage of the X values on b because increasing sd of x increases the accuracy of b. For example, if we were measuring the relationship between SAT scores and college performance, we may have low leverage since only those kids who had high SATs actually went to colelge and thus were included in our study. But if one kid with low SATs got in, he might be a special case and have a huge affect on our equation; that is, one case far out will have lots of leverage on the line and may be considered for removal… The sensitivity of the regression equation to outliers also depends on sample size: A large N will be relatively unaffected by a few weird cases, but a small one obviously will.

DFITS: (difference between fitted values) with the ith case in and the ith case dropped then standardized into z-scores (remember: the area under the curve). Any value over 2 times the square root of (p+1)/n; where p = number of variables (k) are influential and are candidates for being removed. However, the best way to detect naughty cases is to plot these values in a normal quantile plot to visually detect influential cases.

Studentized or standardized residuals: Residuals have a mean of zero; therefore if we divide them by their standard deviations we have standardized them. Since the standard deviations (like the line itself) are affected by the extreme cases, we divide by the standard deviation without that case included. Again, a quantile plot is the best way to pick out the bad apples. But as a rule of thumb > 2 is a rotting apple and > 3 is rotted to the core.

I prefer DFITS since it tells more - it tells leverage not just outlierness; see figures on p. 165.