Hello I am having trouble with the 3D flex training job. It fails with the error below and apears to be running out of memory (GPU=1080).
[2023-04-24 13:44:01.29]
[CPU: 4.35 GB Avail: 249.58 GB]
====== Run 3DFlex Training =======
[2023-04-24 13:44:01.31]
[CPU: 4.35 GB Avail: 249.58 GB]
Starting iterations..
[2023-04-24 13:45:29.87]
[CPU: 4.19 GB Avail: 249.79 GB]
Traceback (most recent call last):
File "cryosparc_master/cryosparc_compute/run.py", line 96, in cryosparc_compute.run.main
File "cryosparc_master/cryosparc_compute/jobs/flex_refine/run_train.py", line 192, in cryosparc_compute.jobs.flex_refine.run_train.run
File "cryosparc_master/cryosparc_compute/jobs/flex_refine/flexmod.py", line 1010, in cryosparc_compute.jobs.flex_refine.flexmod.run_flex_opt
File "/home/cryosparcuser/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "cryosparc_master/cryosparc_compute/jobs/flex_refine/flexmod.py", line 687, in cryosparc_compute.jobs.flex_refine.flexmod.TetraSV.forward
File "cryosparc_master/cryosparc_compute/jobs/flex_refine/flexmod.py", line 663, in cryosparc_compute.jobs.flex_refine.flexmod.TetraSVFunction.forward
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.00 GiB (GPU 0; 7.93 GiB total capacity; 7.32 GiB already allocated; 434.81 MiB free; 7.33 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
It is a training job and my box size is 128 px. What confuses me is that I can run similarly configured training jobs with different datasets and they run to completion with no error. When watching Nvidia-smi I never see the GPU memory allocation go above about 1200MiB (of 8192MiB).
Is there any thing that can be dataset specific to cause the memory requirement to be higher? Can I adjust any of the job parameters to counteract this?