How to load large datasets with Python
As a data scientist or software engineer, you are likely familiar with the Python Pandas library. Pandas is an essential tool for data analysis and manipulation, providing a fast and flexible way to work with structured data. However, you may encounter memory issues when trying to load data into Pandas data frames. That is why is so important to know how to load large datasets with Python.
Understanding the Problem
When working with large datasets, it is common to use CSV files for storing and exchanging data. CSV files are easy to use and can be easily opened in any text editor. However, when you try to load a large CSV file into a Pandas data frame using the “
pd.read_csv()” function, you may encounter memory crashes or out-of-memory errors. This is because Pandas loads the entire CSV file into memory, which can quickly consume all available RAM.
Solution 1: Chunking
One way to avoid memory crashes when loading large CSV files is to use chunks. Chunking involves reading the CSV file in small chunks and processing each chunk separately. This approach can help reduce memory usage by loading only a small portion of the CSV file into memory at a time.
To use chunking, you can set the “
chunksize" parameter in the
"pd.read_csv()" function. This parameter determines the number of rows to read at a time. For example, to read a CSV file in chunks of 1000 rows, you can use the following code:
import pandas as pd chunksize = 100000 for chunk in pd.read_csv('large_file.csv', chunksize=chunksize): # process each chunk here
In this example, the
"pd.read_csv()" function will return an iterator that yields data frames of 100000 rows each. You can then process each chunk separately within the “for” loop.
For instance, given a file about sales of a major e-commerce with 5 000 000 rows, and 3GB size. You do not need the entire dataset but the sales where final prices are greater than 50 EUR. You cannot load it as a Pandas dataframe due to the memory issues, so your choice is to read it little by little through chunks:
## We load our large file of sales in chunks of 500 000 rows: for chunk in pd.read_csv('large_table.csv', chunksize=500000, sep=';'): ## We apply the condition in each chunk chunk = chunk[chunk['final_price']>50] ## We create a new dataframe (sales) concatenating each chunk sales=sales.append(chunk) ## Display when the chunk is processed print(chunk.shape,'chunk loaded') ## Outcome (500000, chunk loaded) (500000, chunk loaded) (12586, chunk loaded) (126320, chunk loaded)
Solution 2: Using Dask
Another solution to the memory issue when reading large CSV files is to use Dask. Dask is a distributed computing library that provides parallel processing capabilities for data analysis. It is built on top of popular Python libraries such as NumPy, Pandas, and scikit-learn, and provides a high-level interface for performing parallel computations. It is designed to work seamlessly with existing Python code and tools, making it a powerful tool for data professionals.
One of the main features of Dask is its ability to handle datasets that are too large to fit into memory. Dask uses lazy evaluation and smart partitioning to efficiently process data in small chunks so users can work with datasets that are larger than the available memory. This makes Dask particularly useful for working with big data and performing complex computations on distributed systems.
Another key feature of Dask is its ability to scale computations across multiple cores or even multiple machines in a cluster. Dask can automatically parallelize computations and distribute them across a cluster taking advantage of the full computing power of their system. This makes Dask capable of processing large datasets in a fraction of the time it would take with traditional single-threaded approaches.
Dask dataframes are one of the core components of the Dask library. They are designed to mimic the behavior of pandas dataframes but with the ability to handle larger-than-memory datasets. Dask dataframes are lazily evaluated and partitioned into smaller chunks called “dask dataframes”, which can be processed in parallel across multiple cores or machines.
To use Dask, you need to install it using pip:
pip install dask[complete]
Once installed, you can use the
"read_csv" function from the
dask.dataframe module to load the CSV file:
import dask.dataframe as dd df = dd.read_csv('large_file.csv')
read_csv()" function returns a Dask data frame that represents the CSV file. You can then perform various operations on the Dask data frame, such as filtering, aggregating, and joining since they are not immediately executed. Instead, they are represented as a graph of computational tasks that can be optimized and scheduled for execution later. This allows Dask to efficiently handle large datasets by only computing the necessary portions of the dataframe when needed. Therefore, Dask can process data sets that are larger than the available memory by using disk storage and partitioning the data across multiple processors or machines.
Solution 3: Compression
Another way to reduce memory usage when loading large CSV files is to use compression. Compression can significantly reduce the size of the CSV file, which can help reduce the amount of memory required to load it into a Pandas data frame.
To use compression, you can compress the CSV file using a compression algorithm, such as gzip or bzip2. Then, you can use the
"pd.read_csv()" function with the
compression parameter to read the compressed file. For example, to read a CSV file that has been compressed using gzip, you can use the following code:
import pandas as pd df = pd.read_csv('large_file.csv.gz', compression='gzip')
read_csv()" function will read the compressed CSV file and decompress it on the fly. This approach can help reduce the amount of memory required to load the CSV file into a Pandas data frame.
In conclusion, working with large datasets in Python Pandas can be challenging due to the lack of memory. However, there are several solutions available, such as chunking, Dask, and compression. Thanks to these solutions, you can handle large datasets in local with Python and Pandas without causing memory crashes.