w3resource

Compare column summation using for loop vs. sum method in Pandas


Pandas: Performance Optimization Exercise-1 with Solution


Write a Pandas program to create a large DataFrame and measure the time taken to sum a column using a for loop vs. using the sum method.

Sample Solution :

Python Code :

import pandas as pd  # Import the Pandas library
import numpy as np  # Import the NumPy library
import time  # Import the time module to measure execution time

# Create a large DataFrame with random integers
np.random.seed(0)  # Set seed for reproducibility
data = np.random.randint(1, 100, size=(1000000, 1))  # Generate random data
df = pd.DataFrame(data, columns=['Values'])  # Create a DataFrame

# Measure the time taken to sum the column using a for loop
start_time = time.time()  # Record the start time
sum_for_loop = 0  # Initialize the sum variable
for value in df['Values']:  # Iterate through each value in the column
    sum_for_loop += value  # Add the value to the sum variable
time_for_loop = time.time() - start_time  # Calculate the time taken

# Measure the time taken to sum the column using the sum method
start_time = time.time()  # Record the start time
sum_method = df['Values'].sum()  # Use the sum method to calculate the sum
time_sum_method = time.time() - start_time  # Calculate the time taken

# Print the results
print("Sum using for loop:", sum_for_loop)
print("Time taken using for loop:", time_for_loop, "seconds")
print("Sum using sum method:", sum_method)
print("Time taken using sum method:", time_sum_method, "seconds")

Output:

Sum using for loop: 49988718
Time taken using for loop: 0.1499619483947754 seconds
Sum using sum method: 49988718
Time taken using sum method: 0.0010004043579101562 seconds                                    

Explanation:

  • Import Libraries:
    • Import the Pandas library for data manipulation.
    • Import the NumPy library for generating random data.
    • Import the time module to measure execution time.
  • Create a Large DataFrame:
    • Set a seed for reproducibility using np.random.seed(0).
    • Generate random integers with np.random.randint and create a large DataFrame with 1,000,000 rows and one column named 'Values'.
  • Measure Time Using a For Loop:
    • Record the start time using time.time().
    • Initialize a variable sum_for_loop to store the sum.
    • Iterate through each value in the 'Values' column using a for loop and add it to sum_for_loop.
    • Calculate the time taken by subtracting the start time from the current time.
  • Measure Time Using Sum Method:
    • Record the start time using time.time().
    • Use the Pandas sum method to calculate the sum of the 'Values' column.
    • Calculate the time taken by subtracting the start time from the current time.
  • Finally display the sum and the time taken for both the for loop and the sum method.

Python-Pandas Code Editor:

Have another way to solve this solution? Contribute your code (and comments) through Disqus.

Previous: Pandas Performence Optimization Exercises Home.
Next: Compare performance of apply vs. Vectorized operations in Pandas.

What is the difficulty level of this exercise?

Test your Programming skills with w3resource's quiz.



Follow us on Facebook and Twitter for latest update.