Calculate Confidence Interval for pandas DataFrame
Calculating a confidence interval helps determine the estimated range of values in which the true parameter value such as the population mean, is likely to fall, with a certain level of confidence (e.g., a 95% confidence interval).
In Python, you can use the groupby
function from pandas to calculate the mean and confidence interval for various groups in the DataFrame.
Sample dataset
In this article, we will use the flights
dataset from the seaborn package to calculate the confidence interval.
# import package
import seaborn as sns
# sample data
flights = sns.load_dataset("flights")
flights.head()
# output
year month passengers
0 1949 Jan 112
1 1949 Feb 118
2 1949 Mar 132
3 1949 Apr 129
4 1949 May 121
This dataset contains the records of a number of passengers that flew each month from 1949 to 1960 and contains three variables year
(grouping variable), month
, and passengers
.
We will use the year
variable to calculate the mean and confidence interval.
Confidence interval formula
Generally, the confidence interval based on Z-test is calculated as:
Where, x̄ is the sample mean, z is the critical value for a certain confidence interval, s is the sample standard deviation, and n is the sample size.
If you want to calculate the confidence interval based on t-test, please visit this article.
Calculate 95% confidence interval
Calculate the 95% confidence interval for the flights
dataset
The critical value (z) is 1.96 for 95% confidence interval
norm
function from scipy to calculate the critical value for a confidence interval. For example, a two-tailed critical value for
a 95% confidence interval is calculated as scipy.stats.norm.ppf(1-.05/2)
.# import package
import pandas as pd
import numpy as np
# calculate upper and lower 95% confidence interval
df = flights.groupby(["year"])['passengers'].describe()[["count", "mean", "std"]].reset_index()
df["lower_ci"] = df["mean"] - 1.96*(df["std"]/np.sqrt(df["count"]))
df["upper_ci"] = df["mean"] + 1.96*(df["std"]/np.sqrt(df["count"]))
df.head()
# output
year count mean std lower_ci upper_ci
0 1949 12.0 126.666667 13.720147 118.903763 134.429570
`2 1951 12.0 170.166667 18.438267 159.734235 180.599098
3 1952 12.0 197.000000 22.966379 184.005548 209.994452
4 1953 12.0 225.000000 28.466887 208.893343 241.106657
In the above output, you can see that we have calculated the upper and lower 95% confidence interval for the flights
dataset.
If you want to plot and shade the regions of the confidence interval, please visit this article.
Calculate 90% confidence interval
Calculate mean and 90% confidence interval for the flights
dataset
The critical value (z) is 1.645 for 90% confidence interval
# import package
import pandas as pd
import numpy as np
# calculate upper and lower 95% confidence interval
df = flights.groupby(["year"])['passengers'].describe()[["count", "mean", "std"]].reset_index()
df["lower_ci"] = df["mean"] - 1.645*(df["std"]/np.sqrt(df["count"]))
df["upper_ci"] = df["mean"] + 1.645*(df["std"]/np.sqrt(df["count"]))
df.head()
# output
year count mean std lower_ci upper_ci
0 1949 12.0 126.666667 13.720147 120.151372 133.181961
1 1950 12.0 139.666667 19.070841 130.610485 148.722848
2 1951 12.0 170.166667 18.438267 161.410876 178.922458
3 1952 12.0 197.000000 22.966379 186.093942 207.906058
4 1953 12.0 225.000000 28.466887 211.481913 238.518087
In the above output, you can see that we have calculated the upper and lower 90% confidence interval for the flights
dataset.