[Explained] How to GroupBy Dataframe in Python, Pandas, PySpark

Name: Oluwaseun Adeojo

Published on 8/17/2023

Grouping data forms an essential part of data analysis, be it for calculating aggregates or applying complex transformations. The pandas groupby function in Python is a robust and versatile tool that enables you to perform such operations efficiently. With its extensive functionality, it streamlines the process of manipulating data grouped based on certain conditions, making data analysis a much smoother task.

The pandas groupby function is especially powerful when it comes to handling large dataframes, thanks to its optimised implementation. By leveraging pandas dataframe groupby, you can group by single or multiple columns, apply several aggregate functions, and even perform advanced tasks like filtering and sorting the grouped data. This guide aims to unravel the power of the pandas groupby function, offering insights, best practices, and practical examples.

Want to quickly create Data Visualization from Python Pandas Dataframe with No code?

PyGWalker is a Python library for Exploratory Data Analysis with Visualization. PyGWalker (opens in a new tab) can simplify your Jupyter Notebook data analysis and data visualization workflow, by turning your pandas dataframe (and polars dataframe) into a Tableau-style User Interface for visual exploration.

(opens in a new tab)

Understanding Pandas GroupBy

In simple terms, pandas groupby operation involves splitting the data into groups based on certain criteria, applying a function to each group, and then combining the results. This process is also known as "split-apply-combine" strategy, a term coined by the pandas library itself.

The groupby function in pandas uses a similar concept as the SQL GROUP BY statement, making it easier for those transitioning from SQL to Python for data analysis.

Here's a basic example of how you can use pandas dataframe groupby:

import pandas as pd
 
## Creating a sample dataframe
data = {
    'Name': ['John', 'Anna', 'John', 'Anna', 'John', 'Anna'],
    'Subject': ['Maths', 'Maths', 'Physics', 'Physics', 'Chemistry', 'Chemistry'],
    'Score': [85, 90, 78, 88, 92, 95]
}
 
df = pd.DataFrame(data)
 
## Applying groupby
grouped = df.groupby('Name')
for name, group in grouped:
    print("\n", name)
    print(group)

Grouping by Multiple Columns

In addition to grouping by a single column, pandas groupby also supports grouping by multiple columns. This is especially useful when you want to categorise your data based on multiple attributes. Let's extend the previous example and perform a pandas groupby multiple columns operation:

## Applying groupby on multiple columns
grouped_multiple = df.groupby(['Name', 'Subject'])
for (name, subject), group in grouped_multiple:
    print("\n", name, subject)
    print(group)

As you can see, pandas dataframe groupby grouped the data first by 'Name', and then by 'Subject' within each 'Name' group. This kind of grouping allows for complex data analysis operations.

Aggregate Functions with Pandas GroupBy

One of the major benefits of pandas groupby is that it allows us to apply aggregate functions to the grouped data. Common aggregate functions include sum, mean, count, max, and min. Let's see an example using pandas groupby and sum:

## Using sum with groupby
grouped_sum = df.groupby('Name')['Score'].sum()
print(grouped_sum)

In the example, we are summing up the scores of each student. Notice that we used the column indexer (['Score']) right after groupby. This is because sum function can only be applied to numeric data. So, we need to select the 'Score' column to apply the sum function.

Sorting Data with Pandas GroupBy

It's common to sort data after performing a groupby operation. For instance, you might want to sort the groups by their aggregate values. Here's how you can use groupby sort values in pandas:

## Sorting data after groupby
grouped_sorted = df.groupby('Name')['Score'].sum().sort_values(ascending=False)
print(grouped_sorted)

In the example, we first grouped the dataframe by 'Name', then summed up the 'Score' for each group, and finally sorted the groups by the sum of 'Score' in descending order.

Custom Aggregation with GroupBy Apply

pandas groupby allows for custom aggregation by using the apply function. This can be useful when built-in aggregate functions do not suffice. For example, suppose you want to calculate the range (maximum - minimum) of scores for each student. You can use groupby apply in pandas as follows:

## Custom aggregation with groupby apply
grouped_apply = df.groupby('Name')['Score'].apply(lambda x: x.max() - x.min())
print(grouped_apply)

In this example, for each group, we calculate the range of 'Score' using a lambda function and apply this function to each group with apply.

Difference Between GroupBy and Pivot in Pandas

Both pandas groupby and pivot table are powerful tools for data summarisation, but they serve different purposes and are used in different contexts. To illustrate, pandas groupby is used when you want to summarise your data based on some category, whereas pivot table is used to reshape your data.

In a pandas groupby operation, you specify one or more columns to group by, and then specify an aggregate function to apply to each group. On the other hand, a pivot table takes simple column-wise data as input, and groups the entries into a two-dimensional table that provides a multidimensional analysis.

GroupBy Non-Numeric Data in Pandas

It is indeed possible to groupby non-numeric data in pandas. While aggregate functions like sum, mean cannot be applied to non-numeric data, there are plenty of operations you can perform on non-numeric data. For example, you can count the number of occurrences of each category, or you can apply any function that makes sense on the data type of the non-numeric column.

## Groupby non-numeric data and count
grouped_count = df.groupby('Name')['Subject'].count()
print(grouped_count)

In this example, we're counting the number of subjects each student has by grouping by 'Name' and counting the 'Subject'.

GroupBy with PySpark

The groupby concept also extends to big data frameworks like PySpark. Although the syntax differs slightly, the idea remains the same - splitting the data into groups and applying some function to each group.

## GroupBy in PySpark
from pyspark.sql import SparkSession
 
spark = SparkSession.builder.getOrCreate()
 
## Load data into PySpark DataFrame
df_pyspark = spark.createDataFrame(df)
 
## GroupBy in PySpark
df_pyspark.groupby('Name').agg({'Score': 'sum'}).show()

In PySpark, you need to use the agg function to apply an aggregate function after grouping. In the example above, we're grouping by 'Name' and summing the 'Score' for each group.

As you delve deeper into the realm of data analysis with Python, you'll find pandas dataframe groupby to be a reliable companion. With its flexibility and power, you can handle and explore data in ways that were previously only available to those with a background in programming or statistics. So dive in, experiment with the different functionalities, and watch as your data yields valuable insights!

FAQs

What is the difference between groupby and pivot in Pandas?

Pandas groupby is used for summarising data based on a category, whereas pivot table is used for reshaping data into a two-dimensional table for multidimensional analysis.

Can I groupby non-numeric data in Pandas?

Yes, you can perform groupby on non-numeric data in Pandas. While you can't apply aggregate functions like sum or mean to non-numeric data, there are plenty of operations that you can perform like counting the number of occurrences of each category.

How do I use groupby with PySpark?

The groupby concept is similar in PySpark as in Pandas. After grouping, you need to use the agg function in PySpark to apply an aggregate function to each group.

Zen of Python: What It Is And How to Access ipykernel: The Python Kernel for Jupyter Notebooks Explained