Compare DataFrame row filtering using for loop vs. Boolean indexing

Last update on May 05 2025 13:03:55 (UTC/GMT +8 hours)

5. Row Filtering: For Loop vs. Boolean Indexing

Write a Pandas program to filter rows of a DataFrame based on a condition using a for loop vs. using boolean indexing. Compare performance.

Sample Solution :

Python Code :

import pandas as pd  # Import the Pandas library
import numpy as np  # Import the NumPy library
import time  # Import the time module to measure execution time

# Create a sample DataFrame
np.random.seed(0)  # Set seed for reproducibility
data = {
    'A': np.random.randint(1, 100, size=100000),
    'B': np.random.randint(1, 100, size=100000)
}
df = pd.DataFrame(data)

# Define the condition
condition = 50

# Filter rows using a for loop
start_time = time.time()  # Record the start time
filtered_rows_loop = []
for index, row in df.iterrows():
    if row['A'] > condition:
        filtered_rows_loop.append(row)
filtered_df_loop = pd.DataFrame(filtered_rows_loop)
time_for_loop = time.time() - start_time  # Calculate the time taken

# Filter rows using boolean indexing
start_time = time.time()  # Record the start time
filtered_df_bool = df[df['A'] > condition]
time_boolean_indexing = time.time() - start_time  # Calculate the time taken

# Print the time taken for both methods
print("Time taken using for loop:", time_for_loop, "seconds")
print("Time taken using boolean indexing:", time_boolean_indexing, "seconds")

Output:

Time taken using for loop: 3.864267349243164 seconds
Time taken using boolean indexing: 0.0010325908660888672 seconds

Explanation:

Import libraries:

Import the Pandas library for data manipulation.
Import the NumPy library for generating random data.
Import the time module to measure execution time.

Create a Sample DataFrame:

Set a seed for reproducibility using np.random.seed(0).
Create a dictionary data with columns 'A' and 'B' containing random integers.
Generate a DataFrame "df" using the dictionary.

Define the condition:

Set a condition value (e.g., condition = 50) to filter rows where the value in column 'A' is greater than this condition.

Filter Rows Using a For Loop:

Record the start time using time.time().
Iterate through each row in the DataFrame using a for loop with df.iterrows().
Append rows that meet the condition to a list filtered_rows_loop.
Convert the list to a DataFrame "filtered_df_loop".
Calculate the time taken by subtracting the start time from the current time.

Filter Rows Using Boolean Indexing:

Record the start time using time.time().
Use boolean indexing to filter rows where the value in column 'A' is greater than the condition.
Store the result in a DataFrame "filtered_df_bool".
Calculate the time taken by subtracting the start time from the current time.

Print Results:

Display the time taken for both the for loop method and the boolean indexing method.

For more Practice: Solve these Related Problems:

Write a Pandas program to filter rows based on a condition using a for loop and measure the execution time.
Write a Pandas program to filter rows using boolean indexing and compare its performance against a manual loop.
Write a Pandas program to implement both filtering methods on a large DataFrame and report the time difference.
Write a Pandas program to create a performance benchmark that compares row filtering with iterative loops versus vectorized boolean indexing.

Go to:

Previous: Reduce memory usage in Pandas DataFrame using astype method.
Next: Compare data aggregation using groupby vs. manual iteration in Pandas.

Python-Pandas Code Editor:

Have another way to solve this solution? Contribute your code (and comments) through Disqus.