Optimize memory usage with Categorical data type in Pandas DataFrame

Last update on May 05 2025 13:03:53 (UTC/GMT +8 hours)

8. Optimize Memory with Categorical Data

Write a Pandas program to create a DataFrame with categorical data and use the category data type to optimize memory usage. Measure the performance difference.

Sample Solution :

Python Code :

import pandas as pd  # Import the Pandas library
import numpy as np  # Import the NumPy library

# Create a sample DataFrame with categorical data
np.random.seed(0)  # Set seed for reproducibility
data = {
    'Category': np.random.choice(['A', 'B', 'C', 'D'], size=1000000),
    'Values': np.random.randint(1, 100, size=1000000)
}
df = pd.DataFrame(data)

# Print memory usage before optimization
print("Memory usage before optimization:")
print(df.info(memory_usage='deep'))

# Convert the 'Category' column to the category data type
df['Category'] = df['Category'].astype('category')

# Print memory usage after optimization
print("\nMemory usage after optimization:")
print(df.info(memory_usage='deep'))

Output:

Memory usage before optimization:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 2 columns):
 #   Column    Non-Null Count    Dtype 
---  ------    --------------    ----- 
 0   Category  1000000 non-null  object
 1   Values    1000000 non-null  int32 
dtypes: int32(1), object(1)
memory usage: 59.1 MB
None

Memory usage after optimization:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 2 columns):
 #   Column    Non-Null Count    Dtype   
---  ------    --------------    -----   
 0   Category  1000000 non-null  category
 1   Values    1000000 non-null  int32   
dtypes: category(1), int32(1)
memory usage: 4.8 MB
None

Explanation:

Import Libraries:

Import the Pandas library for data manipulation.
Import the NumPy library for generating random data.

Create a Sample DataFrame with Categorical Data:

Set a seed for reproducibility using np.random.seed(0).
Create a dictionary data with a 'Category' column containing random category labels and a 'Values' column containing random integers.
Generate a DataFrame df using the dictionary.

Print Memory Usage Before Optimization:

Use df.info(memory_usage='deep') to display the memory usage of the DataFrame before optimization.

Convert Column to Category Data Type:

Use the astype method to convert the 'Category' column to the category data type.

Print Memory Usage After Optimization:

Use df.info(memory_usage='deep') to display the memory usage of the DataFrame after optimization.

For more Practice: Solve these Related Problems:

Write a Pandas program to convert string columns of a DataFrame into categorical data types and measure memory reduction.
Write a Pandas program to create a DataFrame with categorical columns and compare the performance of operations before and after conversion.
Write a Pandas program to optimize memory usage by converting a high-cardinality column to a category and evaluate the effect on processing speed.
Write a Pandas program to load a dataset, convert appropriate columns to 'category' dtype, and then compare memory_usage() with the original DataFrame.

Go to:

Previous: Compare DataFrame merge using merge method vs. nested for loop in Pandas.
Next: Compare DataFrame element-wise multiplication using for loop vs. * Operator.

Python-Pandas Code Editor:

Have another way to solve this solution? Contribute your code (and comments) through Disqus.