Optimize Memory usage when loading large CSV into Pandas DataFrame
Pandas: Performance Optimization Exercise-3 with Solution
Write a Pandas program that loads a large CSV file into a DataFrame and optimizes memory usage by specifying appropriate data types.
Sample Solution :
Python Code :
import pandas as pd # Import the Pandas library
# Define the CSV file path
csv_file_path = 'large_csv_file.csv'
# Load a small chunk of the CSV file to infer data types
chunk = pd.read_csv(csv_file_path, nrows=100)
# Specify the data types for the columns based on the initial chunk
dtype_dict = {
'column1': 'int32',
'column2': 'float32',
'column3': 'category',
'column4': 'category',
# Add more columns with appropriate data types
}
# Load the full CSV file with specified data types to optimize memory usage
df = pd.read_csv(csv_file_path, dtype=dtype_dict)
# Print memory usage after optimization
print("Memory usage after optimization:")
print(df.info(memory_usage='deep'))
Output:
Memory usage after optimization: <class 'pandas.core.frame.DataFrame'> RangeIndex: 2715 entries, 0 to 2714 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 1 2715 non-null int64 1 1.02 2715 non-null float64 2 Folder 2715 non-null object 3 Folder.1 2715 non-null object dtypes: float64(1), int64(1), object(2) memory usage: 629.5 KB None
Explanation:
- Import Pandas Library:
- Import the Pandas library for data manipulation.
- Define CSV File Path:
- Specify the path to the large CSV file with csv_file_path.
- Load Initial Chunk:
- Load a small chunk of the CSV file (e.g., 100 rows) using pd.read_csv(csv_file_path, nrows=100) to infer data types.
- Specify Data Types:
- Based on the initial chunk, create a dictionary 'dtype_dict' that maps column names to appropriate data types (e.g., 'int32', 'float32', 'category').
- Load Full CSV with Specified Data Types:
- Use pd.read_csv(csv_file_path, dtype=dtype_dict) to load the full CSV file while specifying the data types to optimize memory usage.
- Print Memory Usage:
- Use df.info(memory_usage='deep') to print the memory usage of the DataFrame after optimization.
Python-Pandas Code Editor:
Have another way to solve this solution? Contribute your code (and comments) through Disqus.
Previous: Compare performance of apply vs. Vectorized operations in Pandas.
Next: Reduce memory usage in Pandas DataFrame using astype method.
What is the difficulty level of this exercise?
Test your Programming skills with w3resource's quiz.
- Weekly Trends and Language Statistics
- Weekly Trends and Language Statistics