Calculate Confidence Interval for pandas DataFrame

2024-08-14 506 words 3 minutes

Contents

Calculating a confidence interval helps determine the estimated range of values in which the true parameter value such as the population mean, is likely to fall, with a certain level of confidence (e.g., a 95% confidence interval).

In Python, you can use the groupby function from pandas to calculate the mean and confidence interval for various groups in the DataFrame.

Sample dataset

In this article, we will use the flights dataset from the seaborn package to calculate the confidence interval.

# import package
import seaborn as sns

# sample data
flights = sns.load_dataset("flights")
flights.head()

# output
   year month  passengers
0  1949   Jan         112
1  1949   Feb         118
2  1949   Mar         132
3  1949   Apr         129
4  1949   May         121

This dataset contains the records of a number of passengers that flew each month from 1949 to 1960 and contains three variables year (grouping variable), month, and passengers.

We will use the year variable to calculate the mean and confidence interval.

Confidence interval formula

Generally, the confidence interval based on Z-test is calculated as:

$https://latex.codecogs.com/png.image?\dpi{300}&space;Confidence\;Interval=\overline{x}+z(\frac{s}{\sqrt{n}})$

Where, x̄ is the sample mean, z is the critical value for a certain confidence interval, s is the sample standard deviation, and n is the sample size.

If you want to calculate the confidence interval based on t-test, please visit this article.

Calculate 95% confidence interval

Calculate the 95% confidence interval for the flights dataset

The critical value (z) is 1.96 for 95% confidence interval

Tip

You can use the norm function from scipy to calculate the critical value for a confidence interval. For example, a two-tailed critical value for a 95% confidence interval is calculated as scipy.stats.norm.ppf(1-.05/2).

# import package
import pandas as pd
import numpy as np

# calculate upper and lower 95% confidence interval
df = flights.groupby(["year"])['passengers'].describe()[["count", "mean", "std"]].reset_index()
df["lower_ci"] = df["mean"] - 1.96*(df["std"]/np.sqrt(df["count"]))
df["upper_ci"] = df["mean"] + 1.96*(df["std"]/np.sqrt(df["count"]))

df.head()

# output
   year  count        mean        std    lower_ci    upper_ci
0  1949   12.0  126.666667  13.720147  118.903763  134.429570
`2  1951   12.0  170.166667  18.438267  159.734235  180.599098
3  1952   12.0  197.000000  22.966379  184.005548  209.994452
4  1953   12.0  225.000000  28.466887  208.893343  241.106657

In the above output, you can see that we have calculated the upper and lower 95% confidence interval for the flights dataset.

If you want to plot and shade the regions of the confidence interval, please visit this article.

Calculate 90% confidence interval

Calculate mean and 90% confidence interval for the flights dataset

The critical value (z) is 1.645 for 90% confidence interval

# import package
import pandas as pd
import numpy as np

# calculate upper and lower 95% confidence interval
df = flights.groupby(["year"])['passengers'].describe()[["count", "mean", "std"]].reset_index()
df["lower_ci"] = df["mean"] - 1.645*(df["std"]/np.sqrt(df["count"]))
df["upper_ci"] = df["mean"] + 1.645*(df["std"]/np.sqrt(df["count"]))

df.head()

# output
   year  count        mean        std    lower_ci    upper_ci
0  1949   12.0  126.666667  13.720147  120.151372  133.181961
1  1950   12.0  139.666667  19.070841  130.610485  148.722848
2  1951   12.0  170.166667  18.438267  161.410876  178.922458
3  1952   12.0  197.000000  22.966379  186.093942  207.906058
4  1953   12.0  225.000000  28.466887  211.481913  238.518087

In the above output, you can see that we have calculated the upper and lower 90% confidence interval for the flights dataset.