Compare DataFrame row filtering using for loop vs. Boolean indexing
Pandas: Performance Optimization Exercise-5 with Solution
Write a Pandas program to filter rows of a DataFrame based on a condition using a for loop vs. using boolean indexing. Compare performance.
Sample Solution :
Python Code :
import pandas as pd # Import the Pandas library
import numpy as np # Import the NumPy library
import time # Import the time module to measure execution time
# Create a sample DataFrame
np.random.seed(0) # Set seed for reproducibility
data = {
'A': np.random.randint(1, 100, size=100000),
'B': np.random.randint(1, 100, size=100000)
}
df = pd.DataFrame(data)
# Define the condition
condition = 50
# Filter rows using a for loop
start_time = time.time() # Record the start time
filtered_rows_loop = []
for index, row in df.iterrows():
if row['A'] > condition:
filtered_rows_loop.append(row)
filtered_df_loop = pd.DataFrame(filtered_rows_loop)
time_for_loop = time.time() - start_time # Calculate the time taken
# Filter rows using boolean indexing
start_time = time.time() # Record the start time
filtered_df_bool = df[df['A'] > condition]
time_boolean_indexing = time.time() - start_time # Calculate the time taken
# Print the time taken for both methods
print("Time taken using for loop:", time_for_loop, "seconds")
print("Time taken using boolean indexing:", time_boolean_indexing, "seconds")
Output:
Time taken using for loop: 3.864267349243164 seconds Time taken using boolean indexing: 0.0010325908660888672 seconds
Explanation:
- Import libraries:
- Import the Pandas library for data manipulation.
- Import the NumPy library for generating random data.
- Import the time module to measure execution time.
- Create a Sample DataFrame:
- Set a seed for reproducibility using np.random.seed(0).
- Create a dictionary data with columns 'A' and 'B' containing random integers.
- Generate a DataFrame "df" using the dictionary.
- Define the condition:
- Set a condition value (e.g., condition = 50) to filter rows where the value in column 'A' is greater than this condition.
- Filter Rows Using a For Loop:
- Record the start time using time.time().
- Iterate through each row in the DataFrame using a for loop with df.iterrows().
- Append rows that meet the condition to a list filtered_rows_loop.
- Convert the list to a DataFrame "filtered_df_loop".
- Calculate the time taken by subtracting the start time from the current time.
- Filter Rows Using Boolean Indexing:
- Record the start time using time.time().
- Use boolean indexing to filter rows where the value in column 'A' is greater than the condition.
- Store the result in a DataFrame "filtered_df_bool".
- Calculate the time taken by subtracting the start time from the current time.
- Print Results:
- Display the time taken for both the for loop method and the boolean indexing method.
Python-Pandas Code Editor:
Have another way to solve this solution? Contribute your code (and comments) through Disqus.
Previous: Reduce memory usage in Pandas DataFrame using astype method.
Next: Compare data aggregation using groupby vs. manual iteration in Pandas.
What is the difficulty level of this exercise?
Test your Programming skills with w3resource's quiz.
- Weekly Trends and Language Statistics
- Weekly Trends and Language Statistics