Calculate Z-score for Columns in pandas DataFrame

2024-06-03 407 words 2 minutes

Contents

Z-score (also known as standard score) is a statistical measure that calculates how many standard deviations a data point from the mean of the data distribution.

In pandas DataFrame, you can calculate the Z-score for one or all columns using the zscore function from the SciPy Python package or by manual method.

The following example demonstrates how to calculate the Z-score for all numeric columns in a pandas DataFrame.

Using `zscore` function

Create a random pandas DataFrame,

# import package
import pandas as pd
import numpy as np

# set random seed for reproducibility
np.random.seed(42)

# create random dataset
df = pd.DataFrame({'col1': np.random.rand(5), 
                   'col2': np.random.randint(1, 10, 5),
                   'col3': np.random.randn(5)
				   }) 
				   
# view DataFrame
df

       col1  col2      col3
0  0.374540     3  0.392580
1  0.950714     7 -0.929185
2  0.731994     8  0.079832
3  0.598658     5 -0.159517
4  0.156019     4  0.022222

Now, calculate the Z-score for each column of the DataFrame. We will use the zscore function from the SciPy Python package.

# import package
from scipy.stats import zscore

# calculate z score for each column
df_zscore = df.apply(zscore)
 				   
# view DataFrame
df_zscore

      col1      col2      col3
0 -0.680221 -1.293993  1.155573
1  1.406211  0.862662 -1.831159
2  0.614185  1.401826  0.448870
3  0.131353 -0.215666 -0.091975
4 -1.471527 -0.754829  0.318691

The pandas apply function applies zscore function from the SciPy to each column of the DataFrame.

The resulting DataFrame shows the calculated Z-score for all columns.

Tip

If you have missing values (NaN) or categorical columns in the DataFrame, you should drop them before applying the zscore function.

Using Z-score formula

You can also manually calculate the Z-score for each column in the pandas DataFrame using the Z-score formula.

Z-score calculated as:

$https://latex.codecogs.com/png.image?\dpi{400}Z=\frac{x-\mu}{\sigma}$

Where, x is a data point, µ is a mean of the data, and σ is a standard deviation of the dataset.

You can calculate the Z-score for each column using the above formula as below:

# create random dataset
# import package
import pandas as pd
import numpy as np

# set random seed for reproducibility
np.random.seed(42)

df = pd.DataFrame({'col1': np.random.rand(5), 
                   'col2': np.random.randint(1, 10, 5),
                   'col3': np.random.randn(5)
				   }) 
				   
# calculate the Z-score for each column				   
df_zscore = (df - df.mean()) / df.std(ddof=0)

# view DataFrame
df_zscore

       col1      col2      col3
0 -0.680221 -1.293993  1.155573
1  1.406211  0.862662 -1.831159
2  0.614185  1.401826  0.448870
3  0.131353 -0.215666 -0.091975
4 -1.471527 -0.754829  0.318691

In the above equation, ddof=0 is used for calculating the population standard deviation.

The resulting DataFrame shows the calculated Z-score for all columns.

Contents

Calculate Z-score for Columns in pandas DataFrame

Using zscore function

Using Z-score formula

Using `zscore` function