How to Effectively Use Pandas Get Dummies Function

Name: Oluwaseun Adeojo

Published on 8/19/2023

Python's Pandas library has a lot of robust and versatile functions for data manipulation, and the get_dummies function is one of them. This tutorial aims to help you understand and effectively use this function in your data preprocessing tasks.

Want to quickly create Data Visualizations in Python?

PyGWalker is an Open Source Python Project that can help speed up the data analysis and visualization workflow directly within a Jupyter Notebook-based environments.

PyGWalker (opens in a new tab) turns your Pandas Dataframe (or Polars Dataframe) into a visual UI where you can drag and drop variables to create graphs with ease. Simply use the following code:

pip install pygwalker
import pygwalker as pyg
gwalker = pyg.walk(df)

You can run PyGWalker right now with these online notebooks:

And, don't forget to give us a ⭐️ on GitHub!

Run PyGWalker in Kaggle Notebook (opens in a new tab)	Run PyGWalker in Google Colab (opens in a new tab)	Give PyGWalker a ⭐️ on GitHub (opens in a new tab)
(opens in a new tab)	(opens in a new tab)	(opens in a new tab)

Understanding Pandas Get Dummies

Pandas' get_dummies function is a powerful tool when dealing with categorical data. It converts categorical variable(s) into dummy/indicator variables, creating new columns for each unique category in a categorical variable, with ones (1) and zeros (0) indicating the presence or absence of the category in the original row.

Why does this matter? Machine learning algorithms typically work with numerical data. Hence, categorical data often needs to be transformed into a numerical format, which is where get_dummies comes into play.

import pandas as pd
 
# Example data
data = pd.DataFrame({'pets': ['cat', 'dog', 'bird', 'cat']})
 
# Applying get_dummies
dummies_data = pd.get_dummies(data)
 
print(dummies_data)

This will output:

   pets_bird  pets_cat  pets_dog
0          0         1         0
1          0         0         1
2          1         0         0
3          0         1         0

The Anatomy of Get Dummies Function

The get_dummies function has several parameters to allow for granular control over its operation. Here's a brief overview of these parameters:

data: The input DataFrame or Series from which to generate dummy variables.
prefix: Optional string prefix for the column names of the dummy variables.
prefix_sep: Optional string separator to use between the prefix and the column name. Default is "_".
dummy_na: Boolean to add a column to indicate NaNs, if false NaNs are ignored. Default is False.
columns: Optional list of column names to convert into dummy variables. If not specified, it converts all object and category dtype columns.
sparse: Boolean to return SparseDataFrame if True, else return regular DataFrame. Default is False.
drop_first: Boolean to get k-1 dummies out of k categorical levels by removing the first level. This prevents multicollinearity. Default is False.

Practical Examples of Pandas Get Dummies

Let's delve into more practical examples of using the get_dummies function.

1. Using the prefix parameter

You can use the prefix parameter to add a specific prefix to the new dummy variable column names. This can be helpful in identifying the source of these columns later on.

# Apply get_dummies with prefix
dummies_data_prefix = pd.get_dummies(data, prefix='pets')
 
print(dummies_data_prefix)

2. Dealing with NaN values

When dealing with real-world data, you will often encounter missing values. Using the dummy_na parameter, you can create a separate dummy column for NaN values.

# Example data with NaN
data = pd.DataFrame({'pets': ['cat', 'dog', 'bird', None]})
 
# Apply get_dummies with dummy_na
dummies_data_nan = pd.get_dummies(data, dummy_na=True
 
)
 
print(dummies_data_nan)

3. Working with multiple columns

The get_dummies function can be applied to multiple columns at once. In the example below, we create dummy variables for two categorical columns - 'pets' and 'color'.

# Example data with multiple columns
data = pd.DataFrame({'pets': ['cat', 'dog', 'bird', 'cat'], 'color': ['black', 'white', 'black', 'white']})
 
# Apply get_dummies to multiple columns
dummies_data_multi = pd.get_dummies(data, columns=['pets', 'color'])
 
print(dummies_data_multi)

Conclusion

In conclusion, mastering the pd.get_dummies() function can enhance your data preprocessing capabilities for machine learning projects. It's an indispensable tool for handling categorical data, ensuring it's in the right format for your algorithms.

How to Easily Summarize Pandas Dataframes How to Fix 'Cannot Mask with Non-Boolean Array Containing NA / NaN Values'