# The Variance Decomposition of Belsley, Kuh, and Welsch (1980)

A property of eigenvalues is that tr( X’X) = XK=1 . This implies that the sizes of

the eigenvalues are determined in part by the scaling of the data. Data matrices consisting of large numbers will have larger eigenvalues, in total, than data mat­rices with small numbers. To remove the effect of scaling BKW, whose collinearity diagnostic procedure we recommend, suggest scaling the columns of X to unit length. Define Sj = (X)1/2, and let S = diag(s1, s2,…, sK). Then the scaled X matrix is XS-1. This scaling is only for the purpose of diagnosing collinearity, not for model estimation or interpretation. To diagnose collinearity, examine the proportion of the variance of each least C2 squares coefficient contributed by each individual eigenvalue. Define фjk = -,  are ordered in magnitude, with n1 = 1 and nK being the largest, since its denom­inator is XK, the smallest eigenvalue. The largest condition index is often called Table 12.1 summarizes much of what we can learn about collinearity in data. BKW carried out extensive simulations to determine how large condition indexes affect the variances of the least squares estimators. Their diagnostic procedures, also summarized in Belsley (1991, ch. 5), are these:

Step 1

Begin by identifying large condition indices. A small eigenvalue and a near exact linear dependency among the columns of X is associated with each large condi­tion index. BKW’s experiments lead them to the general guidelines that indices in the range 0-10 indicate weak near dependencies, 10-30 indicate moderately strong near dependencies, 30-100 a strong near dependency, and indices in excess of 100 are very strong. Thus when examining condition indexes values of 30 and higher should immediately attract attention.

Step 2 (if there is a single large condition index)

Examine the variance-decomposition proportions. If there is a single large con­dition index, indicating a single near dependency associated with one small eigenvalue, collinearity adversely affects estimation when two or more coefficients each have 50 percent or more of their variance associated with the large con­dition index, in the last row of Table 12.1. The variables involved in the near dependency have coefficients with large variance proportions.

Step 2 (if there are two or more large condition indexes of

RELATIVELY EQUAL MAGNITUDE)

If there are J > 2 large and roughly equal condition indexes, then X’ X has J eigenvalues that are near zero and J near exact linear dependencies among the columns of X exist. Since the J corresponding eigenvectors span the space con­taining the coefficients of the true linear dependence, the "50 percent rule" for identifying the variables involved in the near dependencies must be modified.

If there are two (or more) small eigenvalues, then we have two (or more) near exact linear relations, such as Xcj ~ 0 and Xcj ~ 0. These two relationships do not, necessarily, indicate the form of the linear dependencies, since X(a1ci + a2cj) ~ 0 as well. In this case the two vectors of constants ci and cf define a two-dimensional vector space in which the two near exact linear dependencies exist. While we may not be able to identify the individual relationships among the explanatory variables that are causing the collinearity, we can identify the variables that appear in the two (or more) relations.

Thus variance proportions in a single row do not identify specific linear de­pendencies, as they did when there was but one large condition number. In this case, sum the variance proportions across the J large condition number rows in Table 12.1. The variables involved in the (set of) near linear dependencies are identified by summed coefficient variance proportions of greater than 50 percent.

Step 2 (if there are J > 2 large condition indexes, with one

EXTREMELY LARGE)

An extremely large condition index, arising from a very small eigenvalue, can "mask" the variables involved in other near exact linear dependencies. For example, if one condition index is 500 and another is 50, then there are two near exact linear dependencies among the columns of X. However, the variance decompositions associated with the condition index of 50 may not indicate that there are two or more variables involved in a relationship. Identify the variables involved in the set of near linear dependencies by summing the coefficient vari­ance proportions in the last J rows of Table 12.1, and locating the sums greater than 50 percent.

Step 3

Perhaps the most important step in the diagnostic process is determining which coefficients are not affected by collinearity. If there is a single large condition index, coefficients with variance proportions less than 50 percent in the last row of Table 12.1 are not adversely affected by the collinear relationship in the data. If there are J > 2 large condition indexes, then sum the last J rows of variance proportions. Coefficients with summed variance proportions of less than 50 per­cent are not adversely affected by the collinear relationships. If the parameters of interest have coefficients unaffected by collinearity, then small eigenvalues and large condition numbers are not a problem.

Step 4

If key parameter estimates are adversely affected by collinearity, further diag­nostic steps may be taken. If there is a single large condition index the variance proportions identify the variables involved in the near dependency. If there are multiple large condition indexes, auxiliary regressions may be used to further study the nature of the relationships between the columns of X. In these regres­sions one variable in a near dependency is regressed upon the other variables in the identified set. The usual t-statistics may be used as diagnostic tools to deter­mine which variables are involved in specific linear dependencies. See Belsley (1991, p. 144) for suggestions. Unfortunately, these auxiliary regressions may also be confounded by collinearity, and thus they may not be informative.