Calculate VIF for Categorical Variable

2024-05-29 548 words 3 minutes

Contents

Variance Inflation Factor (VIF) is a commonly used method for detecting multicollinearity in regression models. VIF is generally calculated for the continuous variables.

When there are categorical variables in the dataset, the VIF calculation can be tricky, and we may need to consider additional metrics such as Generalized Variance Inflation Factor (GVIF) for evaluating the multicollinearity for categorical variables.

GVIF is an extension of VIF and is generally used for detecting the multicollinearity in regression models which include categorical variables with more than two levels.

Tip

GVIF is the square root of the VIF (GVIF = VIF[1/(2*df)]) and is adjusted for the number of degrees of freedom associated with the individual predictors, making it suitable for regression models with categorical variables. This is more useful when dealing with non-binary categorical data.

The following examples demonstrate how to calculate the GVIF in R for the categorical variables to detect multicollinearity.

Input dataset

We will use the Duncan dataset from carData package in R. This dataset contains the prestige and other attributes of various occupations in the US.

# import package
library(carData)

# load data
data("Duncan")

# view data
head(Duncan)

          type income education prestige
accountant prof     62        86       82
pilot      prof     72        76       83
architect  prof     75        92       90
author     prof     55        90       76
chemist    prof     64        86       90
minister   prof     21        84       87

The Duncan dataset has both numerical (income and education) and categorical (type) predictors. The prestige is a response variable.

Get the number of levels in the categorical (type) predictor,

summary(Duncan$type)
  bc prof   wc 
  21   18    6

The categorical variable has three levels.

Fit the linear regression

Fit the linear regression model with both categorical (type) and numerical variables (income and education) in the model.

Use prestige as a response variable.

model <- lm(prestige ~ type + income + education, data = Duncan)

Get model summary,

summary(model)

# output
Call:
lm(formula = prestige ~ type + income + education, data = Duncan)

Residuals:
    Min      1Q  Median      3Q     Max 
-14.890  -5.740  -1.754   5.442  28.972 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.18503    3.71377  -0.050  0.96051    
typeprof     16.65751    6.99301   2.382  0.02206 *  
typewc      -14.66113    6.10877  -2.400  0.02114 *  
income        0.59755    0.08936   6.687 5.12e-08 ***
education     0.34532    0.11361   3.040  0.00416 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.744 on 40 degrees of freedom
Multiple R-squared:  0.9131,	Adjusted R-squared:  0.9044 
F-statistic:   105 on 4 and 40 DF,  p-value: < 2.2e-16

Calculate GVIF

You can use the vif function from the car package to calculate the GVIF values for predictor variables in the model.

# import package 
library(car)

vif(model)

# output

             GVIF Df GVIF^(1/(2*Df))
type      5.098592  2        1.502666
income    2.209178  1        1.486330
education 5.297584  1        2.301648

The vif function also outputs adjusted GVIF values in addition to GVIF. The adjusted GVIF values are corrected for the degree of freedom and provide a scale similar to VIF.

The high adjusted GVIF values (similar to VIF) i.e. > 2 indicate the presence of moderate to strong multicollinearity. Please refer this article to see the VIF scale for detecting multicollinearity.

The adjusted GVIF value for education is high (2.30) and you should remove this variable from the model.

This will help to reduce multicollinearity and improve the reliability of your regression coefficients.