Pandas Plot Histogram: Create and Customize Histograms in Python

Name: Rajiv Chandra

Published on 8/17/2023

Data visualization is a crucial aspect of data analysis and Python's Pandas library is a powerful tool that allows us to create insightful visualizations. One such visualization is a histogram, a graphical representation of the distribution of a dataset. In this article, we will explore how to plot a histogram using pandas, customize bins, plot multiple columns, and much more. We will also address some frequently asked questions and provide examples to help you understand the process better.

Histograms are particularly useful when dealing with large datasets, as they can provide a visual summary of the data. They can help us understand the underlying frequency distribution of a set of continuous or discrete data. This can be particularly useful when dealing with data like age groups, where understanding the distribution can provide valuable insights.

Want to quickly create Data Visualization from Python Pandas Dataframe with No code?

PyGWalker is a Python library for Exploratory Data Analysis with Visualization. PyGWalker (opens in a new tab) can simplify your Jupyter Notebook data analysis and data visualization workflow, by turning your pandas dataframe (and polars dataframe) into a Tableau-style User Interface for visual exploration.

(opens in a new tab)

Creating a Histogram in Pandas

Creating a histogram in pandas is straightforward thanks to the hist() function. This function provides a quick way to visualize the distribution of data in a pandas DataFrame or Series. Here's a basic example of how to create a histogram:

import pandas as pd
import matplotlib.pyplot as plt
 
# Create a simple dataframe
data = {'values': [1, 2, 2, 3, 3, 3, 4, 4, 4, 4]}
df = pd.DataFrame(data)
 
# Plot a histogram
df['values'].hist()
plt.show()

In this example, we first import the necessary libraries, pandas and matplotlib. We then create a simple pandas DataFrame and use the hist() function to plot a histogram of the 'values' column. The plt.show() function is used to display the plot.

Customizing Bins in a Pandas Histogram

The hist() function in pandas uses a default number of bins, which is 10. However, you can customize the number of bins according to your needs. The bins parameter in the hist() function is used to specify the number of bins you want in your histogram.

For instance, if you want to increase the number of bins to 20, you can do so as follows:

df['values'].hist(bins=20)
plt.show()

Customizing bins in a pandas histogram can help you get a more detailed view of the data distribution. However, it's important to choose an appropriate number of bins. Too many bins may result in overfitting, where the histogram represents the data too closely and may miss the 'bigger picture'. On the other hand, too few bins may oversimplify the data, making it hard to discern any useful patterns.

Plotting a Histogram with Multiple Columns in Pandas

Pandas also allows you to plot a histogram with multiple columns. This can be particularly useful when you want to compare the distribution of two different variables. To plot a histogram with multiple columns, you simply need to pass the columns to the hist() function.

Here's an example of how to plot a histogram with multiple columns:

# Create a dataframe with two columns
data = {'values1': [1, 2, 2, 3, 3, 3, 4, 4, 4, 4],
        'values2': [1, 1, 2, 2, 3, 3, 3, 4, 
 
4, 4]}
df = pd.DataFrame(data)
 
# Plot a histogram with multiple columns
df.hist(bins=20, alpha=0.5)
plt.show()

In this example, we create a DataFrame with two columns, 'values1' and 'values2'. We then call the hist() function on the DataFrame, which plots a histogram for each column. The alpha parameter is used to set the transparency of the histograms, which makes it easier to compare them.

Plotting a Histogram by Group in Pandas

Another powerful feature of pandas is the ability to plot a histogram by group. This can be particularly useful when you want to compare the distribution of a variable across different groups.

For instance, let's say we have a DataFrame that contains the ages of people in different professions. We can plot a histogram of ages by profession as follows:

# Create a dataframe with age and profession
data = {'age': [23, 25, 22, 30, 32, 40, 35, 24, 28, 35],
        'profession': ['engineer', 'doctor', 'engineer', 'doctor', 'engineer', 'doctor', 'engineer', 'doctor', 'engineer', 'doctor']}
df = pd.DataFrame(data)
 
# Plot a histogram by group
df.groupby('profession')['age'].hist(alpha=0.6)
plt.legend(['Engineer', 'Doctor'])
plt.show()

In this example, we first create a DataFrame with 'age' and 'profession' columns. We then group the DataFrame by 'profession' and call the hist() function on the 'age' column. This results in a histogram of ages for each profession. The alpha parameter is used to set the transparency of the histograms, and the legend() function is used to add a legend to the plot.

Plotting a Normalized Histogram in Pandas

Sometimes, it's useful to plot a normalized histogram to represent the distribution of data as proportions rather than counts. This can be achieved in pandas by setting the density parameter to True in the hist() function.

Here's an example of how to plot a normalized histogram:

# Plot a normalized histogram
df['values1'].hist(density=True)
plt.show()

In this example, the density=True argument ensures that the area under the histogram sums up to 1, effectively giving us a probability density function.

Creating Subplots with Pandas Histogram

Pandas also allows you to create subplots when plotting histograms. This can be particularly useful when you want to compare the distributions of multiple variables side by side. To create subplots, you can use the subplots=True argument in the hist() function.

Here's an example:

# Create subplots
df.hist(bins=20, alpha=0.5, subplots=True, layout=(1,2))
plt.show()

In this example, we create two subplots in a single row for the 'values1' and 'values2' columns. The layout parameter is used to specify the arrangement of the subplots.

Adding Error Bars to a Pandas Histogram

Adding error bars to a histogram can provide a visual representation of the variability or uncertainty in the data. While pandas does not directly support adding error bars to histograms, this can be achieved using the matplotlib library.

Here's an example:

import numpy as np
 
# Calculate mean and standard deviation
mean = df['values1'].mean()
std = df['values1'].std()
 
# Plot histogram with error bars
plt.hist(df['values1'], bins=20, alpha=0.5)
plt.errorbar(mean, 5, xerr=std, fmt='o')
plt.show()

In this example, we first calculate the mean and standard deviation of the 'values1' column. We then plot the histogram and add an error bar at the mean position. The errorbar() function from matplotlib is used to add the error bar.

Conclusion

Histograms are a powerful tool for data visualization, and the pandas library in Python provides a versatile function to create and customize histograms. Whether you're plotting a simple histogram, customizing bins, plotting multiple columns, or creating subplots, pandas has got you covered. Remember, the key to effective data visualization is not only creating insightful plots but also customizing them to suit your specific needs.

FAQs

How can I customize the x-axis ticks in a pandas histogram? You can customize the x-axis ticks using the xticks() function from the matplotlib library. For example, plt.xticks(range(0, 10)) will set the x-axis ticks to range from 0 to 10.
How can I plot a histogram with density in pandas? You can plot a histogram with density by setting the density parameter to True in the hist() function. This will plot a normalized histogram where the area under the histogram will sum up to 1.
How can I add a legend to my pandas histogram? You can add a legend to your pandas histogram using the legend() function from the matplotlib library. For example, plt.legend(['Column1', 'Column2']) will add a legend with 'Column1' and 'Column2'.

Pandas Dataframe: Basic Operations for Beginners Pandas Reorder Columns: Efficient DataFrame Manipulation Techniques