Calculate Z-score for Columns in pandas DataFrame
Z-score (also known as standard score) is a statistical measure that calculates how many standard deviations a data point from the mean of the data distribution.
In pandas DataFrame, you can calculate the Z-score for one or all columns using the zscore
function from the SciPy Python package or by manual method.
The following example demonstrates how to calculate the Z-score for all numeric columns in a pandas DataFrame.
Using zscore
function
Create a random pandas DataFrame,
# import package
import pandas as pd
import numpy as np
# set random seed for reproducibility
np.random.seed(42)
# create random dataset
df = pd.DataFrame({'col1': np.random.rand(5),
'col2': np.random.randint(1, 10, 5),
'col3': np.random.randn(5)
})
# view DataFrame
df
col1 col2 col3
0 0.374540 3 0.392580
1 0.950714 7 -0.929185
2 0.731994 8 0.079832
3 0.598658 5 -0.159517
4 0.156019 4 0.022222
Now, calculate the Z-score for each column of the DataFrame. We will use the zscore
function from the SciPy Python package.
# import package
from scipy.stats import zscore
# calculate z score for each column
df_zscore = df.apply(zscore)
# view DataFrame
df_zscore
col1 col2 col3
0 -0.680221 -1.293993 1.155573
1 1.406211 0.862662 -1.831159
2 0.614185 1.401826 0.448870
3 0.131353 -0.215666 -0.091975
4 -1.471527 -0.754829 0.318691
The pandas apply
function applies zscore
function from the SciPy to each column of the DataFrame.
The resulting DataFrame shows the calculated Z-score for all columns.
NaN
) or categorical columns in the DataFrame, you should drop them before applying the zscore
function.Using Z-score formula
You can also manually calculate the Z-score for each column in the pandas DataFrame using the Z-score formula.
Z-score calculated as:
Where, x is a data point, µ is a mean of the data, and σ is a standard deviation of the dataset.
You can calculate the Z-score for each column using the above formula as below:
# create random dataset
# import package
import pandas as pd
import numpy as np
# set random seed for reproducibility
np.random.seed(42)
df = pd.DataFrame({'col1': np.random.rand(5),
'col2': np.random.randint(1, 10, 5),
'col3': np.random.randn(5)
})
# calculate the Z-score for each column
df_zscore = (df - df.mean()) / df.std(ddof=0)
# view DataFrame
df_zscore
col1 col2 col3
0 -0.680221 -1.293993 1.155573
1 1.406211 0.862662 -1.831159
2 0.614185 1.401826 0.448870
3 0.131353 -0.215666 -0.091975
4 -1.471527 -0.754829 0.318691
In the above equation, ddof=0
is used for calculating the population standard deviation.
The resulting DataFrame shows the calculated Z-score for all columns.