3D flex training error (CUDA out of memory)

frozenfas · April 24, 2023, 12:05pm

Hello I am having trouble with the 3D flex training job. It fails with the error below and apears to be running out of memory (GPU=1080).

[2023-04-24 13:44:01.29]
[CPU:   4.35 GB  Avail: 249.58 GB]
====== Run 3DFlex Training =======
[2023-04-24 13:44:01.31]
[CPU:   4.35 GB  Avail: 249.58 GB]
Starting iterations..
[2023-04-24 13:45:29.87]
[CPU:   4.19 GB  Avail: 249.79 GB]
Traceback (most recent call last):
  File "cryosparc_master/cryosparc_compute/run.py", line 96, in cryosparc_compute.run.main
  File "cryosparc_master/cryosparc_compute/jobs/flex_refine/run_train.py", line 192, in cryosparc_compute.jobs.flex_refine.run_train.run
  File "cryosparc_master/cryosparc_compute/jobs/flex_refine/flexmod.py", line 1010, in cryosparc_compute.jobs.flex_refine.flexmod.run_flex_opt
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "cryosparc_master/cryosparc_compute/jobs/flex_refine/flexmod.py", line 687, in cryosparc_compute.jobs.flex_refine.flexmod.TetraSV.forward
  File "cryosparc_master/cryosparc_compute/jobs/flex_refine/flexmod.py", line 663, in cryosparc_compute.jobs.flex_refine.flexmod.TetraSVFunction.forward
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.00 GiB (GPU 0; 7.93 GiB total capacity; 7.32 GiB already allocated; 434.81 MiB free; 7.33 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

It is a training job and my box size is 128 px. What confuses me is that I can run similarly configured training jobs with different datasets and they run to completion with no error. When watching Nvidia-smi I never see the GPU memory allocation go above about 1200MiB (of 8192MiB).

Is there any thing that can be dataset specific to cause the memory requirement to be higher? Can I adjust any of the job parameters to counteract this?

frozenfas · April 25, 2023, 10:40am

My appologies, I was making a mistake, or at least I was not using exactly the same parameters with the two datasets. With the dataset that failed I mistakenly had a much higher value for “Base num. tetra cells” in the Mesh Prep Job. When I use an identical value it seems to run fine (i.e., currently it has completed 200 iterations where as previosly it failed before starting any). So the “Base num. tetra cells” in the prior Mesh Prep Job was the key parameter to change.