Compare DataFrame merge using merge method vs. nested for loop in Pandas
7. Merge Operation: merge() vs. Nested For Loop
Write a Pandas program that performs a merge operation on two large DataFrames using the "merge" method. It compares the performance with a nested for loop.
Sample Solution :
Python Code :
import pandas as pd  # Import the Pandas library
import numpy as np  # Import the NumPy library
import time  # Import the time module to measure execution time
# Create two large DataFrames
np.random.seed(0)  # Set seed for reproducibility
data1 = {
    'Key': np.random.randint(1, 1000, size=1000),
    'Value1': np.random.randint(1, 100, size=1000)
}
data2 = {
    'Key': np.random.randint(1, 1000, size=1000),
    'Value2': np.random.randint(1, 100, size=1000)
}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
# Perform merge using the merge method
start_time = time.time()  # Record the start time
merged_df = pd.merge(df1, df2, on='Key')
time_merge = time.time() - start_time  # Calculate the time taken
# Perform merge using a nested for loop
start_time = time.time()  # Record the start time
merged_data = []
for index1, row1 in df1.iterrows():
    for index2, row2 in df2.iterrows():
        if row1['Key'] == row2['Key']:
            merged_data.append({**row1, **row2})
merged_df_loop = pd.DataFrame(merged_data)
time_nested_loop = time.time() - start_time  # Calculate the time taken
# Print the time taken for both methods
print("Time taken using merge method:", time_merge, "seconds")
print("Time taken using nested for loop:", time_nested_loop, "seconds") 
Output:
Time taken using merge method: 0.0009975433349609375 seconds Time taken using nested for loop: 31.308058738708496 seconds
Explanation:
- Import Libraries:
- Import the Pandas library for data manipulation.
- Import the NumPy library for generating random data.
- Import the time module to measure execution time.
- Create Two Large DataFrames:
- Set a seed for reproducibility using np.random.seed(0).
- Create two dictionaries data1 and data2 with a 'Key' column containing random integers and a 'Value1' or 'Value2' column containing random integers.
- Generate DataFrames 'df1' and 'df2' using the dictionaries.
- Perform Merge Using "merge" Method:
- Record the start time using time.time().
- Use the "pd.merge" method to merge 'df1' and 'df2' on the 'Key' column.
- Calculate the time taken by subtracting the start time from the current time.
- Perform Merge Using Nested For Loop:
- Record the start time using time.time().
- Initialize an empty list merged_data to store the merged rows.
- Use nested for loops to iterate through each row in 'df1' and 'df2'. If the 'Key' values match, combine the rows and append to merged_data.
- Convert 'merged_data' to a DataFrame 'merged_df_loop'.
- Calculate the time taken by subtracting the start time from the current time.
- Print Results:
- Display the time taken for both the merge method and the nested for loop method.
For more Practice: Solve these Related Problems:
- Write a Pandas program to merge two large DataFrames using the merge() method and measure the runtime.
- Write a Pandas program to simulate a merge using nested for loops and compare its performance with merge().
- Write a Pandas program to benchmark the efficiency of the merge() function against a manual join implemented with loops.
- Write a Pandas program to compare and report the execution time differences between merging DataFrames with merge() and iterative concatenation.
Go to:
PREV : GroupBy Aggregation vs. Manual Iteration.
NEXT :
 Optimize Memory with Categorical Data.
Python-Pandas Code Editor:
Have another way to solve this solution? Contribute your code (and comments) through Disqus.
What is the difficulty level of this exercise?
Test your Programming skills with w3resource's quiz.
