Optimize memory usage with Categorical data type in Pandas DataFrame
Pandas: Performance Optimization Exercise-8 with Solution
Write a Pandas program to create a DataFrame with categorical data and use the category data type to optimize memory usage. Measure the performance difference.
Sample Solution :
Python Code :
import pandas as pd # Import the Pandas library
import numpy as np # Import the NumPy library
# Create a sample DataFrame with categorical data
np.random.seed(0) # Set seed for reproducibility
data = {
'Category': np.random.choice(['A', 'B', 'C', 'D'], size=1000000),
'Values': np.random.randint(1, 100, size=1000000)
}
df = pd.DataFrame(data)
# Print memory usage before optimization
print("Memory usage before optimization:")
print(df.info(memory_usage='deep'))
# Convert the 'Category' column to the category data type
df['Category'] = df['Category'].astype('category')
# Print memory usage after optimization
print("\nMemory usage after optimization:")
print(df.info(memory_usage='deep'))
Output:
Memory usage before optimization: <class 'pandas.core.frame.DataFrame'> RangeIndex: 1000000 entries, 0 to 999999 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Category 1000000 non-null object 1 Values 1000000 non-null int32 dtypes: int32(1), object(1) memory usage: 59.1 MB None Memory usage after optimization: <class 'pandas.core.frame.DataFrame'> RangeIndex: 1000000 entries, 0 to 999999 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Category 1000000 non-null category 1 Values 1000000 non-null int32 dtypes: category(1), int32(1) memory usage: 4.8 MB None
Explanation:
- Import Libraries:
- Import the Pandas library for data manipulation.
- Import the NumPy library for generating random data.
- Create a Sample DataFrame with Categorical Data:
- Set a seed for reproducibility using np.random.seed(0).
- Create a dictionary data with a 'Category' column containing random category labels and a 'Values' column containing random integers.
- Generate a DataFrame df using the dictionary.
- Print Memory Usage Before Optimization:
- Use df.info(memory_usage='deep') to display the memory usage of the DataFrame before optimization.
- Convert Column to Category Data Type:
- Use the astype method to convert the 'Category' column to the category data type.
- Print Memory Usage After Optimization:
- Use df.info(memory_usage='deep') to display the memory usage of the DataFrame after optimization.
Python-Pandas Code Editor:
Have another way to solve this solution? Contribute your code (and comments) through Disqus.
Previous: Compare DataFrame merge using merge method vs. nested for loop in Pandas.
Next: Compare DataFrame element-wise multiplication using for loop vs. * Operator.
What is the difficulty level of this exercise?
Test your Programming skills with w3resource's quiz.
- Weekly Trends and Language Statistics
- Weekly Trends and Language Statistics