
Calculate Z-score for Columns in pandas DataFrame

Z-score (also known as standard score) is a statistical measure that calculates how many standard deviations a data point from the mean of the data distribution.

In pandas DataFrame, you can calculate the Z-score for one or all columns using the zscore function from the SciPy Python package or by manual method.

The following example demonstrates how to calculate the Z-score for all numeric columns in a pandas DataFrame.

Using zscore function

Create a random pandas DataFrame,

# import package
import pandas as pd
import numpy as np

# set random seed for reproducibility

# create random dataset
df = pd.DataFrame({'col1': np.random.rand(5), 
                   'col2': np.random.randint(1, 10, 5),
                   'col3': np.random.randn(5)
# view DataFrame

       col1  col2      col3
0  0.374540     3  0.392580
1  0.950714     7 -0.929185
2  0.731994     8  0.079832
3  0.598658     5 -0.159517
4  0.156019     4  0.022222

Now, calculate the Z-score for each column of the DataFrame. We will use the zscore function from the SciPy Python package.

# import package
from scipy.stats import zscore

# calculate z score for each column
df_zscore = df.apply(zscore)
# view DataFrame

      col1      col2      col3
0 -0.680221 -1.293993  1.155573
1  1.406211  0.862662 -1.831159
2  0.614185  1.401826  0.448870
3  0.131353 -0.215666 -0.091975
4 -1.471527 -0.754829  0.318691

The pandas apply function applies zscore function from the SciPy to each column of the DataFrame.

The resulting DataFrame shows the calculated Z-score for all columns.

If you have missing values (NaN) or categorical columns in the DataFrame, you should drop them before applying the zscore function.

Using Z-score formula

You can also manually calculate the Z-score for each column in the pandas DataFrame using the Z-score formula.

Z-score calculated as:\dpi{400}Z=\frac{x-\mu}{\sigma}

Where, x is a data point, µ is a mean of the data, and σ is a standard deviation of the dataset.

You can calculate the Z-score for each column using the above formula as below:

# create random dataset
# import package
import pandas as pd
import numpy as np

# set random seed for reproducibility

df = pd.DataFrame({'col1': np.random.rand(5), 
                   'col2': np.random.randint(1, 10, 5),
                   'col3': np.random.randn(5)
# calculate the Z-score for each column				   
df_zscore = (df - df.mean()) / df.std(ddof=0)

# view DataFrame

       col1      col2      col3
0 -0.680221 -1.293993  1.155573
1  1.406211  0.862662 -1.831159
2  0.614185  1.401826  0.448870
3  0.131353 -0.215666 -0.091975
4 -1.471527 -0.754829  0.318691

In the above equation, ddof=0 is used for calculating the population standard deviation.

The resulting DataFrame shows the calculated Z-score for all columns.