Managing memory usage

Dask is a great tool for managing memory usage when processing datasets. Here are few tips on how to utilize dask in python to maximize the compute resource.

Check your current usage and kill unused applications

# sh

## Check real time memory usage 
vm_stat # prints out a list of memory stats

top -l 1 | grep PhysMem # prints out summary

Interpretation

Out of 64G, you are currently using 63G. In this case, you need to free your memory.

2173M used by OS overhead and other applications
6212M data in memory
Only 388M is unused

# PhysMem: 63G used (2173M wired, 6212M compressor), 388M unused

Using Dask

There are few ways to optimize data processing

chunks{ }
pararelle processing
processing and freeing memeoy regularly
using garbage collector gc

`chunks`

chunks loads data in chunks so depending on the dataset size and timesteps, chunks can be configure differerntly.

Small chunks (1, 20)

Low memory use but slower due to Dask overhead
Large global data

Moderate chunks (30, 50)

Computing 10-30 year climatological monthly statistics with daily time steps (30 days) may work well with n=30

Large chunks (100, 365)

Computing 1-5 year climatological weekly statics with hourly time steps (24 X 7) may work well with n=168

No chunk (-1 or { })

Computing small dataset (1-2 year of daily data

# python usage
ds = xr.open_mfdataset(data_files, chunks={"time": n})

Parallel Processing

# python
from dask.distributed import Client
client = Client(n_workers=4, threads_per_worker=2, memory_limit="12GB")
client