Compare column summation using for loop vs. sum method in Pandas
Pandas: Performance Optimization Exercise-1 with Solution
Write a Pandas program to create a large DataFrame and measure the time taken to sum a column using a for loop vs. using the sum method.
Sample Solution :
Python Code :
import pandas as pd # Import the Pandas library
import numpy as np # Import the NumPy library
import time # Import the time module to measure execution time
# Create a large DataFrame with random integers
np.random.seed(0) # Set seed for reproducibility
data = np.random.randint(1, 100, size=(1000000, 1)) # Generate random data
df = pd.DataFrame(data, columns=['Values']) # Create a DataFrame
# Measure the time taken to sum the column using a for loop
start_time = time.time() # Record the start time
sum_for_loop = 0 # Initialize the sum variable
for value in df['Values']: # Iterate through each value in the column
sum_for_loop += value # Add the value to the sum variable
time_for_loop = time.time() - start_time # Calculate the time taken
# Measure the time taken to sum the column using the sum method
start_time = time.time() # Record the start time
sum_method = df['Values'].sum() # Use the sum method to calculate the sum
time_sum_method = time.time() - start_time # Calculate the time taken
# Print the results
print("Sum using for loop:", sum_for_loop)
print("Time taken using for loop:", time_for_loop, "seconds")
print("Sum using sum method:", sum_method)
print("Time taken using sum method:", time_sum_method, "seconds")
Output:
Sum using for loop: 49988718 Time taken using for loop: 0.1499619483947754 seconds Sum using sum method: 49988718 Time taken using sum method: 0.0010004043579101562 seconds
Explanation:
- Import Libraries:
- Import the Pandas library for data manipulation.
- Import the NumPy library for generating random data.
- Import the time module to measure execution time.
- Create a Large DataFrame:
- Set a seed for reproducibility using np.random.seed(0).
- Generate random integers with np.random.randint and create a large DataFrame with 1,000,000 rows and one column named 'Values'.
- Measure Time Using a For Loop:
- Record the start time using time.time().
- Initialize a variable sum_for_loop to store the sum.
- Iterate through each value in the 'Values' column using a for loop and add it to sum_for_loop.
- Calculate the time taken by subtracting the start time from the current time.
- Measure Time Using Sum Method:
- Record the start time using time.time().
- Use the Pandas sum method to calculate the sum of the 'Values' column.
- Calculate the time taken by subtracting the start time from the current time.
- Finally display the sum and the time taken for both the for loop and the sum method.
Python-Pandas Code Editor:
Have another way to solve this solution? Contribute your code (and comments) through Disqus.
Previous: Pandas Performence Optimization Exercises Home.
Next: Compare performance of apply vs. Vectorized operations in Pandas.
What is the difficulty level of this exercise?
Test your Programming skills with w3resource's quiz.
- Weekly Trends and Language Statistics
- Weekly Trends and Language Statistics