How to Perform Inner Joins in Python

Have you ever worked on a project that required data from multiple sources? You may have found yourself in a situation where you needed to combine data from two or more tables to create a complete dataset. In the world of programming, this process is called joining tables. One of the most common types of joins is the inner join. In this article, we will discuss how to perform inner joins in Python.

Before we dive into inner joins, let’s first understand what a join is. A join is a way of combining two or more tables based on a related column between them. This related column is called the key. When you join tables, you create a new table that contains all of the columns from both tables.

An inner join only includes rows that have matching values in both tables. In other words, an inner join returns only the rows where the key values are present in both tables.

Let’s say we have two tables: Orders and Customers. Both tables have a common column named "CustomerID", which we can use as the key to perform an inner join.

To perform an inner join in Python, we will use the Pandas library. Pandas is a powerful library for data manipulation and analysis. You can install Pandas by running the following command in your terminal:

pip install pandas

Once you have Pandas installed, you can start by importing the library into your Python script:

import pandas as pd

Now, let’s create two sample tables: Orders and Customers:

orders = pd.DataFrame({
    'OrderID': [1, 2, 3, 4, 5],
    'CustomerID': [1, 1, 2, 3, 3],
    'OrderDate': ['2022-08-01', '2022-08-02', '2022-08-03', '2022-08-04', '2022-08-05'],
    'Amount': [100, 150, 200, 50, 75]
})

customers = pd.DataFrame({
    'CustomerID': [1, 2, 3],
    'CustomerName': ['Alice', 'Bob', 'Charlie'],
    'City': ['New York', 'London', 'Paris']
})

Our Orders table has columns for OrderID, CustomerID, OrderDate, and Amount. The Customers table has columns for CustomerID, CustomerName, and City.

To perform an inner join, we will use the merge() function from Pandas:

merged = pd.merge(orders, customers, on='CustomerID', how='inner')

The merge() function requires two arguments: the two tables that you want to join and the column that you want to join on. In our case, we want to join the Orders and Customers tables on the CustomerID column.

The how argument specifies the type of join that you want to perform. In this case, we want to perform an inner join, so we specify ‘inner’.

The result of the inner join will be a new table that contains all of the columns from both tables, but only the rows where the CustomerID values match in both tables.

Let’s take a closer look at the merged table:

print(merged)

Output:

   OrderID  CustomerID   OrderDate  Amount CustomerName      City
0        1           1  2022-08-01     100        Alice  New York
1        2           1  2022-08-02     150        Alice  New York
2        3           2  2022-08-03     200          Bob    London
3        4           3  2022-08-04      50      Charlie     Paris
4        5           3  2022-08-05      75      Charlie     Paris

As you can see, the merged table contains all of the columns from both tables, but only the rows where the CustomerID values match in both tables.

One thing to note is that the resulting merged table may contain duplicate columns. In the example above, both the Orders and Customers tables have a column named CustomerID. When we join the two tables, the resulting merged table will have two columns named CustomerID.

To avoid duplicate columns, we can specify a suffix for the columns from the second table:

merged = pd.merge(orders, customers, on='CustomerID', how='inner', suffixes=('_order', '_customer'))

This will append ‘_order’ to the column names from the Orders table and ‘_customer’ to the column names from the Customers table.

print(merged)

Output:

   OrderID  CustomerID   OrderDate  Amount CustomerName      City
0        1           1  2022-08-01     100        Alice  New York
1        2           1  2022-08-02     150        Alice  New York
2        3           2  2022-08-03     200          Bob    London
3        4           3  2022-08-04      50      Charlie     Paris
4        5           3  2022-08-05      75      Charlie     Paris

Now that we have performed an inner join in Python, let’s explore some more advanced joins.

Outer Join

An outer join returns all rows from both tables, including the rows where the key values are not present in both tables.

outer_merged = pd.merge(orders, customers, on='CustomerID', how='outer', suffixes=('_order', '_customer'))

Left Join

A left join returns all rows from the left table and the matching rows from the right table. If there are no matching rows in the right table, the result will contain null values.

left_merged = pd.merge(orders, customers, on='CustomerID', how='left', suffixes=('_order', '_customer'))

Right Join

A right join returns all rows from the right table and the matching rows from the left table. If there are no matching rows in the left table, the result will contain null values.

right_merged = pd.merge(orders, customers, on='CustomerID', how='right', suffixes=('_order', '_customer'))

Conclusion

In this article, we discussed how to perform inner joins in Python. We used the Pandas library to merge two tables based on a related column. Inner joins only include rows that have matching values in both tables. We also explored some more advanced joins, such as outer, left, and right joins. By understanding how to join tables in Python, you can combine data from multiple sources to create a complete dataset for your projects.

Leave a Comment

Your email address will not be published. Required fields are marked *