Thresholds for Detecting Multicollinearity
Multicollinearity occurs when two or more predictors (independent variables) in a regression analysis are highly correlated.
Multicollinearity is problematic in machine learning (ML) as it leads to large standard errors for the regression coefficients and underestimates the statistical significance of predictors. Hence, the resulting ML model could not be reliable.
Multicollinearity can be detected using various methods such as variance inflation factor (VIF), tolerance, correlation, and condition index.
Even though there are several methods for multicollinearity detection, each technique has its threshold for assessing the multicollinearity issue.
This article explains how to set the thresholds for detecting multicollinearity.
Variance inflation factor (VIF)
Variance inflation factor (VIF) is a commonly used method for detecting the multicollinearity in regression models.
VIF measures how much the variance is inflated when there is a multicollinearity in the regression models.
You should use the following thresholds for estimating the multicollinearity using VIF.
VIF thresholds | Multicollinearity |
---|---|
VIF = 1 | No multicollinearity |
VIF between 1-5 | Low to moderate multicollinearity |
VIF > 5 | strong multicollinearity exists |
VIF > 10 | very strong multicollinearity exists |
If you get VIF > 5 for the multicollinearity analysis, it indicates that there is problematic level of multicollinearity exists in the regression model.
Tolerance (T)
Tolerance (T) is a reciprocal of the VIF and is another method to detect multicollinearity in regression models.
You should use the following thresholds for estimating the multicollinearity using tolerance.
T thresholds | Multicollinearity |
---|---|
T = 1 | No multicollinearity |
T < 0.25 | strong multicollinearity exists |
If you get a tolerance < 0.25 for the multicollinearity analysis, it indicates that there is problematic level of multicollinearity exists in the regression model.
Condition index (CI)
The Condition index (CI) is another method for detecting the multicollinearity in regression models.
The condition indices are calculated based on the eigenvalues (variance of linear combinations of the scaled matrix) of the predictors in the regression models.
You should use the following thresholds for estimating the multicollinearity using the condition index.
CI thresholds | Multicollinearity |
---|---|
CI < 10 | No multicollinearity |
CI between 10-30 | Low to moderate multicollinearity |
CI > 30 | strong multicollinearity exists |
Correlation analysis
You can also perform the pairwise correlation analysis (correlation matrix) among all the predictor variables in the regression model.
The high correlation values (close to 1 or -1) among the predictor variables indicate the presence of strong multicollinearity.
As a rule of thumb, if there is a high correlation (> 0.8 or < -0.8) among the predictor variables, there is a presence of strong multicollinearity.