How to Get Group Statistics Using pandas GroupBy

2024-07-31 749 words 4 minutes

Contents

The pandas groupby function is useful for statistical analysis of the group-specific data in the pandas DataFrame.

In pandas DataFrame, the group-wise summary statistics can be obtained by using groupby.describe() and groupby.agg() functions.

The following examples explain how to get group-wise summary statistics for a pandas DataFrame using groupby.describe() and groupby.agg() functions.

Using `groupby.describe()` function

Create a sample pandas DataFrame,

# import package
import pandas as pd

df = pd.DataFrame({'col1': ['A', 'A', 'B', 'B', 'C', 'C'], 	
	'col2': [10, 14, 20, 25, 30, 32]})

# view DataFrame
df

  col1  col2
0    A    10
1    A    14
2    B    20
3    B    25
4    C    30
5    C    32

In this DataFrame, col1 contains the various groups and col2 contains their values.

You can use the groupby.describe() function to get the summary statistics including value counts, mean, standard deviation, min, max, and percentiles for groups in col1. The 50 percentile (50%) is the median.

df.groupby(["col1"])['col2'].describe()

# output
      count  mean       std   min    25%   50%    75%   max
col1
A       2.0  12.0  2.828427  10.0  11.00  12.0  13.00  14.0
B       2.0  22.5  3.535534  20.0  21.25  22.5  23.75  25.0
C       2.0  31.0  1.414214  30.0  30.50  31.0  31.50  32.0

You can also get specific statistics with the groupby.describe() function. For example, if you only want to get value counts, mean, and standard deviation, you can use the groupby.describe() function as below:

df.groupby(["col1"])['col2'].describe()[["count", "mean", "std"]]

# output
      count  mean       std
col1
A       2.0  12.0  2.828427
B       2.0  22.5  3.535534
C       2.0  31.0  1.414214

You can also get group-wise summary statistics based on the multiple number of columns (categories) in the pandas DataFrame.

For example, if you have the following pandas DataFrame,

# import package
import pandas as pd

df = pd.DataFrame({'col1': ['X', 'X', 'X', 'X', 'Y', 'Y', 'Y', 'Y'], 	
	'col2': ['A', 'A', 'B', 'B', 'A', 'A', 'B', 'B'], 	
	'col3': [10, 14, 20, 25, 30, 32, 30, 40]})

# view DataFrame
df

  col1 col2  col3
0    X    A    10
1    X    A    14
2    X    B    20
3    X    B    25
4    Y    A    30
5    Y    A    32
6    Y    B    30
7    Y    B    40

You can use the groupby.describe() function on this DataFrame with multiple categories to get the summary statistics including value counts, mean, standard deviation, min, max, and percentiles.

df.groupby(["col1", "col2"])['col3'].describe()

# output
           count  mean       std   min    25%   50%    75%   max
col1 col2
X    A       2.0  12.0  2.828427  10.0  11.00  12.0  13.00  14.0
     B       2.0  22.5  3.535534  20.0  21.25  22.5  23.75  25.0
Y    A       2.0  31.0  1.414214  30.0  30.50  31.0  31.50  32.0
     B       2.0  35.0  7.071068  30.0  32.50  35.0  37.50  40.0

Using `groupby.agg()` function

You can also use the groupby.agg() function for getting summary statistics for each group in the pandas DataFrame.

Create a sample pandas DataFrame,

# import package
import pandas as pd

df = pd.DataFrame({'col1': ['A', 'A', 'B', 'B', 'C', 'C'], 	
	'col2': [10, 14, 20, 25, 30, 32]})

# view DataFrame
df

  col1  col2
0    A    10
1    A    14
2    B    20
3    B    25
4    C    30
5    C    32

You can use the groupby.agg() function to get the summary statistics such as value counts, mean, standard deviation, min, max, etc.

For example, get the group-wise counts,

df.groupby(["col1"])['col2'].agg("count")

# output
col1
A    2
B    2
C    2

You can also apply various functions such as mean, standard deviation, sum, min and max to the groupby.agg() function.

df.groupby(["col1"])['col2'].agg(["count", "mean", "sum", "std"])

# output
      count  mean  sum       std
col1
A         2  12.0   24  2.828427
B         2  22.5   45  3.535534
C         2  31.0   62  1.414214

In addition, you can also use the groupby.agg() function to get the summary statistics based on the multiple columns (categories) in the DataFrame.

For example, if you have the following pandas DataFrame,

# import package
import pandas as pd

df = pd.DataFrame({'col1': ['X', 'X', 'X', 'X', 'Y', 'Y', 'Y', 'Y'], 	
	'col2': ['A', 'A', 'B', 'B', 'A', 'A', 'B', 'B'], 	
	'col3': [10, 14, 20, 25, 30, 32, 30, 40]})

# view DataFrame
df

  col1 col2  col3
0    X    A    10
1    X    A    14
2    X    B    20
3    X    B    25
4    Y    A    30
5    Y    A    32
6    Y    B    30
7    Y    B    40

You can use the groupby.describe() function on this DataFrame with multiple categories to get the summary statistics including value counts, mean, standard deviation, min, max, etc.

df.groupby(["col1", "col2"])['col3'].agg(["count", "mean", "sum", "std"])

# output
           count  mean  sum       std
col1 col2
X    A         2  12.0   24  2.828427
     B         2  22.5   45  3.535534
Y    A         2  31.0   62  1.414214
     B         2  35.0   70  7.071068

Contents

How to Get Group Statistics Using pandas GroupBy

Using groupby.describe() function

Using groupby.agg() function

Using `groupby.describe()` function

Using `groupby.agg()` function