w3resource

Optimize Memory usage when loading large CSV into Pandas DataFrame


Pandas: Performance Optimization Exercise-3 with Solution


Write a Pandas program that loads a large CSV file into a DataFrame and optimizes memory usage by specifying appropriate data types.

Sample Solution :

Python Code :

import pandas as pd  # Import the Pandas library

# Define the CSV file path
csv_file_path = 'large_csv_file.csv'

# Load a small chunk of the CSV file to infer data types
chunk = pd.read_csv(csv_file_path, nrows=100)

# Specify the data types for the columns based on the initial chunk
dtype_dict = {
    'column1': 'int32',
    'column2': 'float32',
    'column3': 'category',
    'column4': 'category',
    # Add more columns with appropriate data types
}

# Load the full CSV file with specified data types to optimize memory usage
df = pd.read_csv(csv_file_path, dtype=dtype_dict)

# Print memory usage after optimization
print("Memory usage after optimization:")
print(df.info(memory_usage='deep'))

Output:

Memory usage after optimization:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2715 entries, 0 to 2714
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   1         2715 non-null   int64  
 1   1.02      2715 non-null   float64
 2   Folder    2715 non-null   object 
 3   Folder.1  2715 non-null   object 
dtypes: float64(1), int64(1), object(2)
memory usage: 629.5 KB
None

Explanation:

  • Import Pandas Library:
    • Import the Pandas library for data manipulation.
  • Define CSV File Path:
    • Specify the path to the large CSV file with csv_file_path.
  • Load Initial Chunk:
    • Load a small chunk of the CSV file (e.g., 100 rows) using pd.read_csv(csv_file_path, nrows=100) to infer data types.
  • Specify Data Types:
    • Based on the initial chunk, create a dictionary 'dtype_dict' that maps column names to appropriate data types (e.g., 'int32', 'float32', 'category').
  • Load Full CSV with Specified Data Types:
    • Use pd.read_csv(csv_file_path, dtype=dtype_dict) to load the full CSV file while specifying the data types to optimize memory usage.
  • Print Memory Usage:
    • Use df.info(memory_usage='deep') to print the memory usage of the DataFrame after optimization.

Python-Pandas Code Editor:

Have another way to solve this solution? Contribute your code (and comments) through Disqus.

Previous: Compare performance of apply vs. Vectorized operations in Pandas.
Next: Reduce memory usage in Pandas DataFrame using astype method.

What is the difficulty level of this exercise?

Test your Programming skills with w3resource's quiz.



Follow us on Facebook and Twitter for latest update.