w3resource

Compare DataFrame row filtering using for loop vs. Boolean indexing


Pandas: Performance Optimization Exercise-5 with Solution


Write a Pandas program to filter rows of a DataFrame based on a condition using a for loop vs. using boolean indexing. Compare performance.

Sample Solution :

Python Code :

import pandas as pd  # Import the Pandas library
import numpy as np  # Import the NumPy library
import time  # Import the time module to measure execution time

# Create a sample DataFrame
np.random.seed(0)  # Set seed for reproducibility
data = {
    'A': np.random.randint(1, 100, size=100000),
    'B': np.random.randint(1, 100, size=100000)
}
df = pd.DataFrame(data)

# Define the condition
condition = 50

# Filter rows using a for loop
start_time = time.time()  # Record the start time
filtered_rows_loop = []
for index, row in df.iterrows():
    if row['A'] > condition:
        filtered_rows_loop.append(row)
filtered_df_loop = pd.DataFrame(filtered_rows_loop)
time_for_loop = time.time() - start_time  # Calculate the time taken

# Filter rows using boolean indexing
start_time = time.time()  # Record the start time
filtered_df_bool = df[df['A'] > condition]
time_boolean_indexing = time.time() - start_time  # Calculate the time taken

# Print the time taken for both methods
print("Time taken using for loop:", time_for_loop, "seconds")
print("Time taken using boolean indexing:", time_boolean_indexing, "seconds")

Output:

Time taken using for loop: 3.864267349243164 seconds
Time taken using boolean indexing: 0.0010325908660888672 seconds

Explanation:

  • Import libraries:
    • Import the Pandas library for data manipulation.
    • Import the NumPy library for generating random data.
    • Import the time module to measure execution time.
  • Create a Sample DataFrame:
    • Set a seed for reproducibility using np.random.seed(0).
    • Create a dictionary data with columns 'A' and 'B' containing random integers.
    • Generate a DataFrame "df" using the dictionary.
  • Define the condition:
    • Set a condition value (e.g., condition = 50) to filter rows where the value in column 'A' is greater than this condition.
  • Filter Rows Using a For Loop:
    • Record the start time using time.time().
    • Iterate through each row in the DataFrame using a for loop with df.iterrows().
    • Append rows that meet the condition to a list filtered_rows_loop.
    • Convert the list to a DataFrame "filtered_df_loop".
    • Calculate the time taken by subtracting the start time from the current time.
  • Filter Rows Using Boolean Indexing:
    • Record the start time using time.time().
    • Use boolean indexing to filter rows where the value in column 'A' is greater than the condition.
    • Store the result in a DataFrame "filtered_df_bool".
    • Calculate the time taken by subtracting the start time from the current time.
  • Print Results:
    • Display the time taken for both the for loop method and the boolean indexing method.

Python-Pandas Code Editor:

Have another way to solve this solution? Contribute your code (and comments) through Disqus.

Previous: Reduce memory usage in Pandas DataFrame using astype method.
Next: Compare data aggregation using groupby vs. manual iteration in Pandas.

What is the difficulty level of this exercise?

Test your Programming skills with w3resource's quiz.



Follow us on Facebook and Twitter for latest update.