Compare performance of apply vs. Vectorized operations in Pandas
Pandas: Performance Optimization Exercise-2 with Solution
Write a Pandas program to compare the performance of applying a custom function to a column using apply vs. using vectorized operations.
Sample Solution :
Python Code :
import pandas as pd # Import the Pandas library
import numpy as np # Import the NumPy library
import time # Import the time module to measure execution time
# Create a large DataFrame with random integers
np.random.seed(0) # Set seed for reproducibility
data = np.random.randint(1, 100, size=(1000000, 1)) # Generate random data
df = pd.DataFrame(data, columns=['Values']) # Create a DataFrame
# Define a custom function to apply
def custom_function(x):
return x * 2 + 3
# Measure the time taken to apply the custom function using apply
start_time = time.time() # Record the start time
df['Apply_Result'] = df['Values'].apply(custom_function) # Apply the custom function using apply
time_apply = time.time() - start_time # Calculate the time taken
# Measure the time taken to apply the custom function using vectorized operations
start_time = time.time() # Record the start time
df['Vectorized_Result'] = custom_function(df['Values']) # Apply the custom function using vectorized operations
time_vectorized = time.time() - start_time # Calculate the time taken
# Print the time taken for both methods
print("Time taken using apply:", time_apply, "seconds")
print("Time taken using vectorized operations:", time_vectorized, "seconds")
Output:
Time taken using apply: 0.25844264030456543 seconds Time taken using vectorized operations: 0.0029630661010742188 seconds
Explanation:
- Import Libraries:
- Import the Pandas library for data manipulation.
- Import the NumPy library for generating random data.
- Import the time module to measure execution time.
- Create a Large DataFrame:
- Set a seed for reproducibility using np.random.seed(0).
- Generate random integers with np.random.randint and create a large DataFrame with 1,000,000 rows and one column named 'Values'.
- Define a Custom Function:
- Create a custom function custom_function(x) that performs a simple operation on the input x (e.g., x * 2 + 3).
- Measure Time Using apply:
- Record the start time using time.time().
- Apply the custom function to the 'Values' column using the Pandas apply method and store the result in a new column 'Apply_Result'.
- Calculate the time taken by subtracting the start time from the current time.
- Measure Time Using Vectorized Operations:
- Record the start time using time.time().
- Apply the custom function to the 'Values' column using vectorized operations and store the result in a new column 'Vectorized_Result'.
- Calculate the time taken by subtracting the start time from the current time.
- Finally display the time taken for both the "apply()" method and the vectorized operations.
Python-Pandas Code Editor:
Have another way to solve this solution? Contribute your code (and comments) through Disqus.
Previous: Compare column summation using for loop vs. sum method in Pandas.
Next: Optimize Memory usage when loading large CSV into Pandas DataFrame.
What is the difficulty level of this exercise?
Test your Programming skills with w3resource's quiz.
- Weekly Trends and Language Statistics
- Weekly Trends and Language Statistics