Dask
is a great tool for managing memory usage when processing datasets. Here are few tips on how to utilize dask
in python
to maximize the compute resource.
Check your current usage and kill unused applications
# sh
## Check real time memory usage
vm_stat # prints out a list of memory stats
top -l 1 | grep PhysMem # prints out summary
Interpretation
Out of 64G, you are currently using 63G. In this case, you need to free your memory.
- 2173M used by OS overhead and other applications
- 6212M data in memory
- Only 388M is unused
# PhysMem: 63G used (2173M wired, 6212M compressor), 388M unused
Using Dask
There are few ways to optimize data processing
chunks{ }
- pararelle processing
- processing and freeing memeoy regularly
- using garbage collector
gc
chunks
chunks
loads data in chunks so depending on the dataset size and timesteps, chunks can be configure differerntly.
Small chunks (1, 20)
- Low memory use but slower due to Dask overhead
- Large global data
Moderate chunks (30, 50)
- Computing 10-30 year climatological monthly statistics with daily time steps (30 days) may work well with n=30
Large chunks (100, 365)
- Computing 1-5 year climatological weekly statics with hourly time steps (24 X 7) may work well with n=168
No chunk (-1 or { })
- Computing small dataset (1-2 year of daily data
# python usage
ds = xr.open_mfdataset(data_files, chunks={"time": n})
Parallel Processing
# python
from dask.distributed import Client
client = Client(n_workers=4, threads_per_worker=2, memory_limit="12GB")
client