Optimize Memory usage when loading large CSV into Pandas DataFrame
Pandas: Performance Optimization Exercise-3 with Solution
Write a Pandas program that loads a large CSV file into a DataFrame and optimizes memory usage by specifying appropriate data types.
Sample Solution :
Python Code :
import pandas as pd # Import the Pandas library
# Define the CSV file path
csv_file_path = 'large_csv_file.csv'
# Load a small chunk of the CSV file to infer data types
chunk = pd.read_csv(csv_file_path, nrows=100)
# Specify the data types for the columns based on the initial chunk
dtype_dict = {
'column1': 'int32',
'column2': 'float32',
'column3': 'category',
'column4': 'category',
# Add more columns with appropriate data types
}
# Load the full CSV file with specified data types to optimize memory usage
df = pd.read_csv(csv_file_path, dtype=dtype_dict)
# Print memory usage after optimization
print("Memory usage after optimization:")
print(df.info(memory_usage='deep'))
Output:
Memory usage after optimization: <class 'pandas.core.frame.DataFrame'> RangeIndex: 2715 entries, 0 to 2714 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 1 2715 non-null int64 1 1.02 2715 non-null float64 2 Folder 2715 non-null object 3 Folder.1 2715 non-null object dtypes: float64(1), int64(1), object(2) memory usage: 629.5 KB None
Explanation:
- Import Pandas Library:
- Import the Pandas library for data manipulation.
- Define CSV File Path:
- Specify the path to the large CSV file with csv_file_path.
- Load Initial Chunk:
- Load a small chunk of the CSV file (e.g., 100 rows) using pd.read_csv(csv_file_path, nrows=100) to infer data types.
- Specify Data Types:
- Based on the initial chunk, create a dictionary 'dtype_dict' that maps column names to appropriate data types (e.g., 'int32', 'float32', 'category').
- Load Full CSV with Specified Data Types:
- Use pd.read_csv(csv_file_path, dtype=dtype_dict) to load the full CSV file while specifying the data types to optimize memory usage.
- Print Memory Usage:
- Use df.info(memory_usage='deep') to print the memory usage of the DataFrame after optimization.
Python-Pandas Code Editor:
Have another way to solve this solution? Contribute your code (and comments) through Disqus.
Previous: Compare performance of apply vs. Vectorized operations in Pandas.
Next: Reduce memory usage in Pandas DataFrame using astype method.
What is the difficulty level of this exercise?
Test your Programming skills with w3resource's quiz.
It will be nice if you may share this link in any developer community or anywhere else, from where other developers may find this content. Thanks.
https://198.211.115.131/python-exercises/pandas/optimize-memory-usage-when-loading-large-csv-into-pandas-dataframe.php
- Weekly Trends and Language Statistics
- Weekly Trends and Language Statistics