Compare DataFrame merge using merge method vs. nested for loop in Pandas
Pandas: Performance Optimization Exercise-7 with Solution
Write a Pandas program that performs a merge operation on two large DataFrames using the "merge" method. It compares the performance with a nested for loop.
Sample Solution :
Python Code :
import pandas as pd # Import the Pandas library
import numpy as np # Import the NumPy library
import time # Import the time module to measure execution time
# Create two large DataFrames
np.random.seed(0) # Set seed for reproducibility
data1 = {
'Key': np.random.randint(1, 1000, size=1000),
'Value1': np.random.randint(1, 100, size=1000)
}
data2 = {
'Key': np.random.randint(1, 1000, size=1000),
'Value2': np.random.randint(1, 100, size=1000)
}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
# Perform merge using the merge method
start_time = time.time() # Record the start time
merged_df = pd.merge(df1, df2, on='Key')
time_merge = time.time() - start_time # Calculate the time taken
# Perform merge using a nested for loop
start_time = time.time() # Record the start time
merged_data = []
for index1, row1 in df1.iterrows():
for index2, row2 in df2.iterrows():
if row1['Key'] == row2['Key']:
merged_data.append({**row1, **row2})
merged_df_loop = pd.DataFrame(merged_data)
time_nested_loop = time.time() - start_time # Calculate the time taken
# Print the time taken for both methods
print("Time taken using merge method:", time_merge, "seconds")
print("Time taken using nested for loop:", time_nested_loop, "seconds")
Output:
Time taken using merge method: 0.0009975433349609375 seconds Time taken using nested for loop: 31.308058738708496 seconds
Explanation:
- Import Libraries:
- Import the Pandas library for data manipulation.
- Import the NumPy library for generating random data.
- Import the time module to measure execution time.
- Create Two Large DataFrames:
- Set a seed for reproducibility using np.random.seed(0).
- Create two dictionaries data1 and data2 with a 'Key' column containing random integers and a 'Value1' or 'Value2' column containing random integers.
- Generate DataFrames 'df1' and 'df2' using the dictionaries.
- Perform Merge Using "merge" Method:
- Record the start time using time.time().
- Use the "pd.merge" method to merge 'df1' and 'df2' on the 'Key' column.
- Calculate the time taken by subtracting the start time from the current time.
- Perform Merge Using Nested For Loop:
- Record the start time using time.time().
- Initialize an empty list merged_data to store the merged rows.
- Use nested for loops to iterate through each row in 'df1' and 'df2'. If the 'Key' values match, combine the rows and append to merged_data.
- Convert 'merged_data' to a DataFrame 'merged_df_loop'.
- Calculate the time taken by subtracting the start time from the current time.
- Print Results:
- Display the time taken for both the merge method and the nested for loop method.
Python-Pandas Code Editor:
Have another way to solve this solution? Contribute your code (and comments) through Disqus.
Previous: Compare data aggregation using groupby vs. manual iteration in Pandas.
Next: Optimize memory usage with Categorical data type in Pandas DataFrame.
What is the difficulty level of this exercise?
Test your Programming skills with w3resource's quiz.
- Weekly Trends and Language Statistics
- Weekly Trends and Language Statistics