Compare data aggregation using groupby vs. manual iteration in Pandas
Pandas: Performance Optimization Exercise-6 with Solution
Write a Pandas program that uses the groupby method to aggregate data and compares performance with manually iterating through the DataFrame.
Sample Solution :
Python Code :
import pandas as pd # Import the Pandas library
import numpy as np # Import the NumPy library
import time # Import the time module to measure execution time
# Create a sample DataFrame
np.random.seed(0) # Set seed for reproducibility
data = {
'Category': np.random.choice(['A', 'B', 'C', 'D'], size=1000000),
'Values': np.random.randint(1, 100, size=1000000)
}
df = pd.DataFrame(data)
# Define a custom aggregation function
def custom_aggregation(data):
result = {}
for category in data['Category'].unique():
result[category] = data[data['Category'] == category]['Values'].sum()
return result
# Aggregate data using the groupby method
start_time = time.time() # Record the start time
groupby_result = df.groupby('Category')['Values'].sum()
time_groupby = time.time() - start_time # Calculate the time taken
# Aggregate data using manual iteration
start_time = time.time() # Record the start time
manual_result = custom_aggregation(df)
time_manual = time.time() - start_time # Calculate the time taken
# Print the results
print("Aggregation result using groupby:")
print(groupby_result)
print("\nTime taken using groupby:", time_groupby, "seconds")
print("\nAggregation result using manual iteration:")
print(manual_result)
print("\nTime taken using manual iteration:", time_manual, "seconds")
Output:
Aggregation result using groupby: Category A 12541392 B 12440541 C 12477135 D 12502875 Name: Values, dtype: int32 Time taken using groupby: 0.0588076114654541 seconds Aggregation result using manual iteration: {'A': 12541392, 'D': 12502875, 'B': 12440541, 'C': 12477135} Time taken using manual iteration: 0.2567422389984131 seconds
Explanation:
- Import Libraries:
- Import the Pandas library for data manipulation.
- Import the NumPy library for generating random data.
- Import the time module to measure execution time.
- Create a Sample DataFrame:
- Set a seed for reproducibility using np.random.seed(0).
- Create a dictionary data with a 'Category' column containing random category labels and a 'Values' column containing random integers.
- Generate a DataFrame df using the dictionary.
- Define Custom Aggregation Function:
- Create a function custom_aggregation(data) to manually iterate through the DataFrame and aggregate the 'Values' column based on the 'Category' column.
- Aggregate Data Using groupby:
- Record the start time using time.time().
- Use the groupby method to aggregate the 'Values' column by 'Category' and calculate the sum.
- Calculate the time taken by subtracting the start time from the current time.
- Aggregate Data Using Manual Iteration:
- Record the start time using time.time().
- Use the custom_aggregation function to manually iterate through the DataFrame and aggregate the 'Values' column based on the 'Category' column.
- Calculate the time taken by subtracting the start time from the current time.
- Finally display the aggregation results and the time taken for both the groupby method and the manual iteration method.
Python-Pandas Code Editor:
Have another way to solve this solution? Contribute your code (and comments) through Disqus.
Previous: Compare DataFrame row filtering using for loop vs. Boolean indexing.
Next: Compare DataFrame merge using merge method vs. nested for loop in Pandas.
What is the difficulty level of this exercise?
Test your Programming skills with w3resource's quiz.
- Weekly Trends and Language Statistics
- Weekly Trends and Language Statistics