Calculate VIF for Categorical Variable
Variance Inflation Factor (VIF) is a commonly used method for detecting multicollinearity in regression models. VIF is generally calculated for the continuous variables.
When there are categorical variables in the dataset, the VIF calculation can be tricky, and we may need to consider additional metrics such as Generalized Variance Inflation Factor (GVIF) for evaluating the multicollinearity for categorical variables.
GVIF is an extension of VIF and is generally used for detecting the multicollinearity in regression models which include categorical variables with more than two levels.
GVIF = VIF[1/(2*df)]
) and is adjusted for the number of degrees of freedom associated with the individual predictors,
making it suitable for regression models with categorical variables. This is more useful when dealing with non-binary categorical data.The following examples demonstrate how to calculate the GVIF in R for the categorical variables to detect multicollinearity.
Input dataset
We will use the Duncan dataset from carData
package in R. This dataset contains the prestige and other attributes of
various occupations in the US.
# import package
library(carData)
# load data
data("Duncan")
# view data
head(Duncan)
type income education prestige
accountant prof 62 86 82
pilot prof 72 76 83
architect prof 75 92 90
author prof 55 90 76
chemist prof 64 86 90
minister prof 21 84 87
The Duncan dataset has both numerical (income
and education
) and categorical (type
) predictors. The prestige
is
a response variable.
Get the number of levels in the categorical (type
) predictor,
summary(Duncan$type)
bc prof wc
21 18 6
The categorical variable has three levels.
Fit the linear regression
Fit the linear regression model with both categorical (type
) and numerical variables (income
and education
) in the model.
Use prestige
as a response variable.
model <- lm(prestige ~ type + income + education, data = Duncan)
Get model summary,
summary(model)
# output
Call:
lm(formula = prestige ~ type + income + education, data = Duncan)
Residuals:
Min 1Q Median 3Q Max
-14.890 -5.740 -1.754 5.442 28.972
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.18503 3.71377 -0.050 0.96051
typeprof 16.65751 6.99301 2.382 0.02206 *
typewc -14.66113 6.10877 -2.400 0.02114 *
income 0.59755 0.08936 6.687 5.12e-08 ***
education 0.34532 0.11361 3.040 0.00416 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 9.744 on 40 degrees of freedom
Multiple R-squared: 0.9131, Adjusted R-squared: 0.9044
F-statistic: 105 on 4 and 40 DF, p-value: < 2.2e-16
Calculate GVIF
You can use the vif
function from the car package to calculate the GVIF values for predictor
variables in the model.
# import package
library(car)
vif(model)
# output
GVIF Df GVIF^(1/(2*Df))
type 5.098592 2 1.502666
income 2.209178 1 1.486330
education 5.297584 1 2.301648
The vif
function also outputs adjusted GVIF values in addition to GVIF. The adjusted GVIF values
are corrected for the degree of freedom and provide a scale similar to VIF.
The high adjusted GVIF values (similar to VIF) i.e. > 2 indicate the presence of moderate to strong multicollinearity. Please refer this article to see the VIF scale for detecting multicollinearity.
The adjusted GVIF value for education
is high (2.30) and you should remove this variable from the model.
This will help to reduce multicollinearity and improve the reliability of your regression coefficients.