How to Get Group Statistics Using pandas GroupBy
The pandas groupby
function is useful for statistical analysis of the group-specific data in the pandas DataFrame.
In pandas DataFrame, the group-wise summary statistics can be obtained by using groupby.describe()
and groupby.agg()
functions.
The following examples explain how to get group-wise summary statistics for a pandas DataFrame using groupby.describe()
and groupby.agg()
functions.
Using groupby.describe()
function
Create a sample pandas DataFrame,
# import package
import pandas as pd
df = pd.DataFrame({'col1': ['A', 'A', 'B', 'B', 'C', 'C'],
'col2': [10, 14, 20, 25, 30, 32]})
# view DataFrame
df
col1 col2
0 A 10
1 A 14
2 B 20
3 B 25
4 C 30
5 C 32
In this DataFrame, col1
contains the various groups and col2
contains their values.
You can use the groupby.describe()
function to get the summary statistics including value counts, mean, standard deviation, min, max, and
percentiles for groups in col1
. The 50 percentile (50%) is the median.
df.groupby(["col1"])['col2'].describe()
# output
count mean std min 25% 50% 75% max
col1
A 2.0 12.0 2.828427 10.0 11.00 12.0 13.00 14.0
B 2.0 22.5 3.535534 20.0 21.25 22.5 23.75 25.0
C 2.0 31.0 1.414214 30.0 30.50 31.0 31.50 32.0
You can also get specific statistics with the groupby.describe()
function. For example, if you only want to get value counts, mean, and standard deviation, you
can use the groupby.describe()
function as below:
df.groupby(["col1"])['col2'].describe()[["count", "mean", "std"]]
# output
count mean std
col1
A 2.0 12.0 2.828427
B 2.0 22.5 3.535534
C 2.0 31.0 1.414214
You can also get group-wise summary statistics based on the multiple number of columns (categories) in the pandas DataFrame.
For example, if you have the following pandas DataFrame,
# import package
import pandas as pd
df = pd.DataFrame({'col1': ['X', 'X', 'X', 'X', 'Y', 'Y', 'Y', 'Y'],
'col2': ['A', 'A', 'B', 'B', 'A', 'A', 'B', 'B'],
'col3': [10, 14, 20, 25, 30, 32, 30, 40]})
# view DataFrame
df
col1 col2 col3
0 X A 10
1 X A 14
2 X B 20
3 X B 25
4 Y A 30
5 Y A 32
6 Y B 30
7 Y B 40
You can use the groupby.describe()
function on this DataFrame with multiple categories to get the summary statistics including value counts, mean, standard deviation, min, max, and
percentiles.
df.groupby(["col1", "col2"])['col3'].describe()
# output
count mean std min 25% 50% 75% max
col1 col2
X A 2.0 12.0 2.828427 10.0 11.00 12.0 13.00 14.0
B 2.0 22.5 3.535534 20.0 21.25 22.5 23.75 25.0
Y A 2.0 31.0 1.414214 30.0 30.50 31.0 31.50 32.0
B 2.0 35.0 7.071068 30.0 32.50 35.0 37.50 40.0
Using groupby.agg()
function
You can also use the groupby.agg()
function for getting summary statistics for each group in the pandas DataFrame.
Create a sample pandas DataFrame,
# import package
import pandas as pd
df = pd.DataFrame({'col1': ['A', 'A', 'B', 'B', 'C', 'C'],
'col2': [10, 14, 20, 25, 30, 32]})
# view DataFrame
df
col1 col2
0 A 10
1 A 14
2 B 20
3 B 25
4 C 30
5 C 32
You can use the groupby.agg()
function to get the summary statistics such as value counts, mean, standard deviation, min, max, etc.
For example, get the group-wise counts,
df.groupby(["col1"])['col2'].agg("count")
# output
col1
A 2
B 2
C 2
You can also apply various functions such as mean, standard deviation, sum, min and max to the groupby.agg()
function.
df.groupby(["col1"])['col2'].agg(["count", "mean", "sum", "std"])
# output
count mean sum std
col1
A 2 12.0 24 2.828427
B 2 22.5 45 3.535534
C 2 31.0 62 1.414214
In addition, you can also use the groupby.agg()
function to get the summary statistics based on the multiple columns (categories) in the DataFrame.
For example, if you have the following pandas DataFrame,
# import package
import pandas as pd
df = pd.DataFrame({'col1': ['X', 'X', 'X', 'X', 'Y', 'Y', 'Y', 'Y'],
'col2': ['A', 'A', 'B', 'B', 'A', 'A', 'B', 'B'],
'col3': [10, 14, 20, 25, 30, 32, 30, 40]})
# view DataFrame
df
col1 col2 col3
0 X A 10
1 X A 14
2 X B 20
3 X B 25
4 Y A 30
5 Y A 32
6 Y B 30
7 Y B 40
You can use the groupby.describe()
function on this DataFrame with multiple categories to get the summary statistics including value counts, mean, standard deviation, min, max, etc.
df.groupby(["col1", "col2"])['col3'].agg(["count", "mean", "sum", "std"])
# output
count mean sum std
col1 col2
X A 2 12.0 24 2.828427
B 2 22.5 45 3.535534
Y A 2 31.0 62 1.414214
B 2 35.0 70 7.071068