How to Merge, Join and Concat Pandas DataFrames in Python

Name: Viktor Zinchenko

Published on 3/10/2023

Learn how to merge Pandas DataFrames in Python with our step-by-step guide. We'll cover everything you need to know, from inner and outer joins to merging on specific columns, together with creating data visualization from pandas dataframes with PyGWalker.

Merging, joining, and concatenating DataFrames in pandas are important techniques that allow you to combine multiple datasets into one. These techniques are essential for cleaning, transforming, and analyzing data. Merging, joining, and concatenating are often used interchangeably, but they refer to different methods of combining data. In this post, we will discuss these three important techniques in detail and provide examples of how to use them in Python.

📚

Merging DataFrames in Pandas

Merging is the process of combining two or more DataFrames into a single DataFrame by linking rows based on one or more common keys. The common keys can be one or more columns that have matching values in the DataFrames being merged.

Different Types of Merges

There are four types of merges in pandas: inner, outer, left, and right.

Inner Merge: Returns only the rows that have matching values in both DataFrames.
Outer Merge: Returns all the rows from both DataFrames and fills in the missing values with NaN where there is no match.
Left Merge: Returns all the rows from the left DataFrame and the matching rows from the right DataFrame. Fills in the missing values with NaN where there is no match.
Right Merge: Returns all the rows from the right DataFrame and the matching rows from the left DataFrame. Fills in the missing values with NaN where there is no match.

Examples of How to Perform Different Types of Merges

Let's look at some examples of how to perform different types of merges using Pandas.

Example 1: Inner Merge

import pandas as pd

# Creating two DataFrames
df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],
                    'value': [1, 2, 3, 4]})
df2 = pd.DataFrame({'key': ['B', 'D', 'E', 'F'],
                    'value': [5, 6, 7, 8]})

# Inner merge
merged_inner = pd.merge(df1, df2, on='key')

print(merged_inner)

Output:

  key  value_x  value_y
0   B        2        5
1   D        4        6

Example 2: Outer Merge

import pandas as pd

# Creating two DataFrames
df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],
                    'value': [1, 2, 3, 4]})
df2 = pd.DataFrame({'key': ['B', 'D', 'E', 'F'],
                    'value': [5, 6, 7, 8]})

# Outer merge
merged_outer = pd.merge(df1, df2, on='key', how='outer')

print(merged_outer)

Output:

  key  value_x  value_y
0   A      1.0      NaN
1   B      2.0      5.0
2   C      3.0      NaN
3   D      4.0      6.0
4   E      NaN      7.0
5   F      NaN      8.0

Example 3: Left Merge A left merge returns all the rows from the left DataFrame and the matched rows from the right DataFrame. Any rows from the left DataFrame that do not have a match in the right DataFrame will have NaN values in the columns of the right DataFrame.

import pandas as pd

# Create two DataFrames
df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': [1, 2, 3, 4]})
df2 = pd.DataFrame({'key': ['B', 'D', 'E'], 'value': [5, 6, 7]})

# Perform a left merge
left_merged_df = pd.merge(df1, df2, on='key', how='left')

# Print the merged DataFrame
print(left_merged_df)

Output:

  key  value_x  value_y
0   A        1     NaN
1   B        2     5.0
2   C        3     NaN
3   D        4     6.0

Example 4: Right Merge A right merge returns all the rows from the right DataFrame and the matched rows from the left DataFrame. Any rows from the right DataFrame that do not have a match in the left DataFrame will have NaN values in the columns of the left DataFrame.

import pandas as pd

# Create two DataFrames
df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['B', 'D', 'E'], 'value': [5, 6, 7]})

# Perform a right merge
right_merged_df = pd.merge(df1, df2, on='key', how='right')

# Print the merged DataFrame
print(right_merged_df)

Output:

  key  value_x  value_y
0   B      2.0       5
1   D      NaN       6
2   E      NaN       7

Joining DataFrames in pandas

Joining is a method of combining two DataFrames into one based on their index or column values.

There are four types of joins in pandas: inner, outer, left, and right.

Inner Join: Returns only the rows that have matching index or column values in both DataFrames.
Outer Join: Returns all the rows from both DataFrames and fills in the missing values with NaN where there is no match.
Left Join: Returns all the rows from the left DataFrame and the matching rows from the right DataFrame. Fills in the missing values with NaN where there is no match.
Right Join: Returns all the rows from the right DataFrame and the matching rows from the left DataFrame. Fills in the missing values with NaN where there is no match.

Concatenating DataFrames in pandas

Concatenating is the process of joining two or more DataFrames either vertically or horizontally. In pandas, this can be achieved using the concat() function. The concat() function allows you to combine two or more DataFrames into a single DataFrame by stacking them either vertically or horizontally.

Examples of how to concatenate two or more DataFrames using pandas

To concatenate two or more DataFrames vertically, you can use the following code:

import pandas as pd

# Create two sample DataFrames
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']})

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']})

# Concatenate the DataFrames vertically
result = pd.concat([df1, df2])

print(result)

Output:

    A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1
2  A2  B2  C2  D2
3  A3  B3  C3  D3
0  A4  B4  C4  D4
1  A5  B5  C5  D5
2  A6  B6  C6  D6
3  A7  B7  C7  D7

To concatenate two or more DataFrames horizontally, you can use the following code:

import pandas as pd

# Create two sample DataFrames
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']})

df2 = pd.DataFrame({'E': ['E0', 'E1', 'E2', 'E3'],
                    'F': ['F0', 'F1', 'F2', 'F3'],
                    'G': ['G0', 'G1', 'G2', 'G3'],
                    'H': ['H0', 'H1', 'H2', 'H3']})

# Concatenate the DataFrames horizontally
result = pd.concat([df1, df2], axis=1)

print(result)

Output:

    A   B   C   D   E   F   G   H
0  A0  B0  C0  D0  E0  F0  G0  H0
1  A1  B1  C1  D1  E1  F1  G1  H1
2  A2  B2  C2  D2  E2  F2  G2  H2

Create Concat View for Panda Dataframes

For creating Concat Views within Python, there is an Open Source Data Analysis & Data Visualization package that can get you covered: PyGWalker (opens in a new tab).

Create Concat Views with Pandas Dataframes

PyGWalker can simplify your Jupyter Notebook data analysis and data visualization workflow. By bringing a lightweight, easy-to-use interface instead of analyzing data using Python. The steps are easy:

Import pygwalker and pandas to your Jupyter Notebook to get started.

import pandas as pd
import pygwalker as pyg

You can use pygwalker without changing your existing workflow. For example, you can call up Graphic Walker with the dataframe loaded in this way:

df = pd.read_csv('./bike_sharing_dc.csv', parse_dates=['date'])
gwalker = pyg.walk(df)

Now you can visualize your Pandas Dataframe with a user-friendly UI!

(opens in a new tab)

You can simply create a Concat View by dragging and dropping variables:

(opens in a new tab)

To test out PyGWalker right now, you can run PyGWalker in Google Colab (opens in a new tab), Binder (opens in a new tab) or Kaggle (opens in a new tab).

PyGWalker is Open Source. You can check out PyGWalker GitHub page (opens in a new tab) and read the Towards Data Science Article (opens in a new tab) of it.

Don't forget to check out a more advanced, AI-empowered Automated Data Analysis tool: RATH (opens in a new tab). RATH is also open-sourced and hosted its source code on GitHub (opens in a new tab).

FAQ

How can I join two DataFrames using it?

PySpark is an open-source big data processing framework that allows you to write data processing applications in Python, Java, Scala, or R. To join two DataFrames using PySpark, you can use the join() method, which takes two DataFrame objects and an optional join expression. You can specify the type of join using the how parameter.

# import PySpark library and create SparkSession
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("JoinExample").getOrCreate()

# create two DataFrames
df1 = spark.createDataFrame([(1, 'A'), (2, 'B'), (3, 'C')], ['id', 'letter'])
df2 = spark.createDataFrame([(1, 'X'), (2, 'Y'), (3, 'Z')], ['id', 'symbol'])

# join the two DataFrames using PySpark
joined_df = df1.join(df2, 'id', 'inner')

# show the resulting DataFrame
joined_df.show()

How can I merge two DataFrames using R?

To merge two DataFrames using R, you can use the merge() function, which takes two data frames and an optional set of arguments that specify how the data should be merged.

# create two data frames
df1 <- data.frame(id = c(1, 2, 3), letter = c("A", "B", "C"))
df2 <- data.frame(id = c(1, 2, 4), symbol = c("X", "Y", "Z"))

# merge the two data frames using R
merged_df <- merge(df1, df2, by = "id", all = TRUE)

# show the resulting data frame
print(merged_df)

How can I append two or more DataFrames in pandas?

To append two or more DataFrames in pandas, you can use the concat() function, which takes a list of DataFrames and an optional axis parameter that specifies the axis along which the DataFrames should be concatenated.

# import pandas library
import pandas as pd

# create two DataFrames
df1 = pd.DataFrame({'id': [1, 2, 3], 'letter': ['A', 'B', 'C']})
df2 = pd.DataFrame({'id': [4, 5, 6], 'letter': ['D', 'E', 'F']})

# append the two DataFrames using pandas
appended_df = pd.concat([df1, df2], ignore_index=True)

# show the resulting DataFrame
print(appended_df)

How can I join two DataFrames based on a common column using pandas?

To join two DataFrames based on a common column using pandas, you can use the merge() function, which takes two DataFrames and an optional set of arguments that specify how the data should be merged. You can specify the column to join using the on parameter.

# create two DataFrames
df1 = pd.DataFrame({'id': [1, 2, 3], 'letter': ['A', 'B', 'C']})
df2 = pd.DataFrame({'id': [1, 2, 4], 'symbol': ['X', 'Y', 'Z']})

# join the two DataFrames using pandas
joined_df = pd.merge(df1, df2, on='id', how='inner')

# show the resulting DataFrame
print(joined_df)

Conclusion

In conclusion, merging, joining, and concatenating DataFrames are essential operations in data analysis. With the help of powerful tools like pandas, PySpark, and R, these operations can be performed easily and efficiently. Whether you are dealing with large or small datasets, these tools offer flexible and intuitive ways to manipulate your data.

📚