How to merge DataFrames with pandas merge()
With the pandas function DataFrame.merge()
, you can combine DataFrames using keys. This allows you to efficiently combine data from various sources, enabling you to carry out more comprehensive analyses.
- 99.9% uptime and super-fast loading
- Advanced security features
- Domain and email included
What is the syntax for pandas merge()
?
The Python pandas DataFrame merge()
method can accept a range of different parameters, allowing developers to specify how DataFrames should be combined. The general syntax of the merge()
function is as follows:
DataFrame.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)
pythonThe pandas merge()
function is similar to the JOIN operation in SQL. Having experience with relational databases like SQL can make it easier to understand how the pandas DataFrame merge()
method words. Keep in mind, though, that there are some differences. In pandas, if both key columns have values that evaluate to null, those values will also be merged.
What parameters can be used with pandas merge
?
The various parameters accepted by merge()
allow you to not only specify which pandas DataFrames to combine but also which type of join to use as well as other details.
Parameter | Description | Default Value |
---|---|---|
left | The first DataFrame to be merged | |
right | The second DataFrame to be merged | |
how | The type of join operation to perform (inner , outer , left , right )
|
inner
|
on | The column(s) or index level(s) to use as keys; must be present in both DataFrames | |
left_on | The column(s) or index level(s) of the left DataFrame that should be used as the key(s) | |
right_on | The column(s) or index level(s) of the right DataFrame that should be used as the key(s) | |
left_index | If True , the index of the left DataFrame is used as the key
|
False
|
right_index | If True , the index of the right DataFrame is used as the key
|
False
|
sort | If True , the resulting DataFrame’s keys are sorted lexicographically
|
False
|
suffixes | Suffixes used to distinguish columns with the same name | ("_x", "_y")
|
copy | If False , a copy is avoided
|
True
|
How to use pandas merge()
Below are some examples to help illustrate how to use pandas merge()
effectively.
INNER JOIN
An INNER JOIN combines two pandas DataFrames and only returns the rows where the keys match in both DataFrames. To get a better idea of how this works, let’s create two DataFrames:
import pandas as pd
# Sample DataFrames
df1 = pd.DataFrame({
'Key': ['A', 'B', 'C'],
'Value1': [1, 2, 3]
})
df2 = pd.DataFrame({
'Key': ['B', 'C', 'D'],
'Value2': [4, 5, 6]
})
print(df1)
print(df2)
pythonThe two resulting DataFrames look like this:
Key Value1
0 A 1
1 B 2
2 C 3
Key Value2
0 B 4
1 C 5
2 D 6
Now, we can perform an INNER JOIN
using the merge()
function:
# INNER JOIN
result = pd.merge(df1, df2, how='inner', on='Key')
print(result)
pythonThe output shows that only the rows that have keys B and C are included in the DataFrame. This is because these two keys exist in both of the original DataFrames.
Key Value1 Value2
0 B 2 4
1 C 3 5
OUTER JOIN
An OUTER JOIN
also merges two DataFrames, but unlike INNER JOIN
, it returns all the rows and fills in missing values with NaN
.
# OUTER JOIN
result = pd.merge(df1, df2, how='outer', on='Key')
print(result)
pythonThe DataFrame below includes all the rows from both DataFrames. NaN
is used for the missing values in key A, which is only in df1
and key D, which is only in df2
.
Key Value1 Value2
0 A 1.0 NaN
1 B 2.0 4.0
2 C 3.0 5.0
3 D NaN 6.0
The other standard variants of JOIN
work in a similar manner.
Using left_on
and right_on
Sometimes, two DataFrames have different column names for their keys. In this case, you can use the left_on
and right_on
parameters to specify which columns to use. First, let’s create two new DataFrames:
df3 = pd.DataFrame({
'Key': ['A', 'B', 'C'],
'Value1': [1, 2, 3]
})
df4 = pd.DataFrame({
'Key2': ['B', 'C', 'D'],
'Value2': [4, 5, 6]
})
print(df3)
print(df4)
pythonThe two DataFrames look like this:
Key Value1
0 A 1
1 B 2
2 C 3
Key2 Value2
0 B 4
1 C 5
2 D 6
We can use the left_on
and right_on
parameters to perform the JOIN
operation using different keys:
# Join with different key column names
result = pd.merge(df3, df4, how='inner', left_on='Key', right_on='Key2')
print(result)
pythonBy explicitly using left_on='Key'
and right_on='Key2'
, the corresponding key columns are utilised for the merge.
Key Value1 Key2 Value2
0 B 2 B 4
1 C 3 C 5
Using indices as keys
You can also use the indices of DataFrames as keys by setting the left_index
and right_index
parameters to True
. First, let’s create two new DataFrames with indices:
df5 = pd.DataFrame({
'Value1': [1, 2, 3]
}, index=['A', 'B', 'C'])
df6 = pd.DataFrame({
'Value2': [4, 5, 6]
}, index=['B', 'C', 'D'])
print(df5)
print(df6)
pythonHere are the DataFrames:
Value1
A 1
B 2
C 3
Value2
B 4
C 5
D 6
Now, we can perform a JOIN
operation using the indices:
# JOIN with indices
result = pd.merge(df5, df6, how='inner', left_index=True, right_index=True)
print(result)
pythonThe result is a JOIN
that uses the indices from the DataFrames:
Value1 Value2
B 2 4
C 3 5