3D Flex generate memory error

olibclarke · June 22, 2024, 2:55pm

Hi,

After a successful training run (900k particles, 256px full res, 100px for training), 3D flex generate fails, apparently running out of GPU memory (on a 24GB card).

Presumably this is because of the number of particles - is there any way to run Flex Generate just on a subset of particles, using the trained model?

Cheers
Oli

EDIT:
Interesting - 3D Flex reconstruct runs to completion successfully with no issues, but 3D flex generate runs out of memory (without anything else running on the system)

Traceback (most recent call last):
  File "cryosparc_master/cryosparc_compute/run.py", line 115, in cryosparc_master.cryosparc_compute.run.main
  File "cryosparc_master/cryosparc_compute/jobs/flex_refine/run_generate.py", line 64, in cryosparc_master.cryosparc_compute.jobs.flex_refine.run_generate.run
  File "/home/exx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/exx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "cryosparc_master/cryosparc_compute/jobs/flex_refine/flexmod.py", line 509, in cryosparc_master.cryosparc_compute.jobs.flex_refine.flexmod.NNFlex3TM.forward
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 14.33 GiB. GPU 0 has a total capacty of 23.68 GiB of which 7.25 GiB is free. Including non-PyTorch memory, this process has 16.27 GiB memory in use. Of the allocated memory 15.75 GiB is allocated by PyTorch, and 27.00 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

yoshiokc · July 30, 2024, 5:53pm

seeing the same… also the memory requested vs. available numbers in the error don’t seem to be a problem.

vperetroukhin · August 6, 2024, 5:16pm

@olibclarke @yoshiokc – thanks for reporting! This behaviour is due to a recent update which changed how we handle flow in the 3D Flex generate job. We have a fix for this in the works that should be released in short order.

vperetroukhin · August 8, 2024, 8:56pm

FYI: This should be fixed in Patch 240807.