[Explained] How to GroupBy Dataframe in Python, Pandas, PySpark
Published on
Grouping data forms an essential part of data analysis, be it for calculating aggregates or applying complex transformations. The pandas groupby
function in Python is a robust and versatile tool that enables you to perform such operations efficiently. With its extensive functionality, it streamlines the process of manipulating data grouped based on certain conditions, making data analysis a much smoother task.
The pandas groupby
function is especially powerful when it comes to handling large dataframes, thanks to its optimised implementation. By leveraging pandas dataframe groupby
, you can group by single or multiple columns, apply several aggregate functions, and even perform advanced tasks like filtering and sorting the grouped data. This guide aims to unravel the power of the pandas groupby
function, offering insights, best practices, and practical examples.
Want to quickly create Data Visualization from Python Pandas Dataframe with No code?
PyGWalker is a Python library for Exploratory Data Analysis with Visualization. PyGWalker (opens in a new tab) can simplify your Jupyter Notebook data analysis and data visualization workflow, by turning your pandas dataframe (and polars dataframe) into a Tableau-style User Interface for visual exploration.
Understanding Pandas GroupBy
In simple terms, pandas groupby
operation involves splitting the data into groups based on certain criteria, applying a function to each group, and then combining the results. This process is also known as "split-apply-combine" strategy, a term coined by the pandas
library itself.
The groupby
function in pandas uses a similar concept as the SQL GROUP BY
statement, making it easier for those transitioning from SQL to Python for data analysis.
Here's a basic example of how you can use pandas dataframe groupby
:
import pandas as pd
## Creating a sample dataframe
data = {
'Name': ['John', 'Anna', 'John', 'Anna', 'John', 'Anna'],
'Subject': ['Maths', 'Maths', 'Physics', 'Physics', 'Chemistry', 'Chemistry'],
'Score': [85, 90, 78, 88, 92, 95]
}
df = pd.DataFrame(data)
## Applying groupby
grouped = df.groupby('Name')
for name, group in grouped:
print("\n", name)
print(group)
Grouping by Multiple Columns
In addition to grouping by a single column, pandas groupby
also supports grouping by multiple columns. This is especially useful when you want to categorise your data based on multiple attributes. Let's extend the previous example and perform a pandas groupby multiple columns
operation:
## Applying groupby on multiple columns
grouped_multiple = df.groupby(['Name', 'Subject'])
for (name, subject), group in grouped_multiple:
print("\n", name, subject)
print(group)
As you can see, pandas dataframe groupby
grouped the data first by 'Name', and then by 'Subject' within each 'Name' group. This kind of grouping allows for complex data analysis operations.
Aggregate Functions with Pandas GroupBy
One of the major benefits of pandas groupby
is that it allows us to apply aggregate functions to the grouped data. Common aggregate functions include sum
, mean
, count
, max
, and min
. Let's see an example using pandas groupby and sum
:
## Using sum with groupby
grouped_sum = df.groupby('Name')['Score'].sum()
print(grouped_sum)
In the example, we are summing up the scores of each student. Notice that we used the column indexer (['Score']
) right after groupby
. This is because sum
function can only be applied to numeric data. So, we need to select the 'Score' column to apply the sum
function.
Sorting Data with Pandas GroupBy
It's common to sort data after performing a groupby operation. For instance, you might want to sort the groups by their aggregate values. Here's how you can use groupby sort values in pandas
:
## Sorting data after groupby
grouped_sorted = df.groupby('Name')['Score'].sum().sort_values(ascending=False)
print(grouped_sorted)
In the example, we first grouped the dataframe by 'Name', then summed up the 'Score' for each group, and finally sorted the groups by the sum of 'Score' in descending order.
Custom Aggregation with GroupBy Apply
pandas groupby
allows for custom aggregation by using the apply
function. This can be useful when built-in aggregate functions do not suffice. For example, suppose you want to calculate the range (maximum - minimum) of scores for each student. You can use groupby apply in pandas
as follows:
## Custom aggregation with groupby apply
grouped_apply = df.groupby('Name')['Score'].apply(lambda x: x.max() - x.min())
print(grouped_apply)
In this example, for each group, we calculate the range of 'Score' using a lambda function and apply this function to each group with apply
.
Difference Between GroupBy and Pivot in Pandas
Both pandas groupby
and pivot table are powerful tools for data summarisation, but they serve different purposes and are used in different contexts. To illustrate, pandas groupby
is used when you want to summarise your data based on some category, whereas pivot table is used to reshape your data.
In a pandas groupby
operation, you specify one or more columns to group by, and then specify an aggregate function to apply to each group. On the other hand, a pivot table takes simple column-wise data as input, and groups the entries into a two-dimensional table that provides a multidimensional analysis.
GroupBy Non-Numeric Data in Pandas
It is indeed possible to groupby non-numeric data in pandas
. While aggregate functions like sum
, mean
cannot be applied to non-numeric data, there are plenty of operations you can perform on non-numeric data. For example, you can count the number of occurrences of each category, or you can apply any function that makes sense on the data type of the non-numeric column.
## Groupby non-numeric data and count
grouped_count = df.groupby('Name')['Subject'].count()
print(grouped_count)
In this example, we're counting the number of subjects each student has by grouping by 'Name' and counting the 'Subject'.
GroupBy with PySpark
The groupby
concept also extends to big data frameworks like PySpark. Although the syntax differs slightly, the idea remains the same - splitting the data into groups and applying some function to each group.
## GroupBy in PySpark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
## Load data into PySpark DataFrame
df_pyspark = spark.createDataFrame(df)
## GroupBy in PySpark
df_pyspark.groupby('Name').agg({'Score': 'sum'}).show()
In PySpark, you need to use the agg
function to apply an aggregate function after grouping. In the example above, we're grouping by 'Name' and summing the 'Score' for each group.
As you delve deeper into the realm of data analysis with Python, you'll find pandas dataframe groupby
to be a reliable companion. With its flexibility and power, you can handle and explore data in ways that were previously only available to those with a background in programming or statistics. So dive in, experiment with the different functionalities, and watch as your data yields valuable insights!
FAQs
- What is the difference between
groupby
and pivot in Pandas?
Pandas groupby
is used for summarising data based on a category, whereas pivot table is used for reshaping data into a two-dimensional table for multidimensional analysis.
- Can I
groupby
non-numeric data in Pandas?
Yes, you can perform groupby
on non-numeric data in Pandas. While you can't apply aggregate functions like sum
or mean
to non-numeric data, there are plenty of operations that you can perform like counting the number of occurrences of each category.
- How do I use
groupby
with PySpark?
The groupby
concept is similar in PySpark as in Pandas. After grouping, you need to use the agg
function in PySpark to apply an aggregate function to each group.