3D-flex reconstruction stalls at large box sizes

olibclarke · December 16, 2022, 5:08pm

Hi,

When the input cropped particles are pretty large at full size (450px), 3D-flex reconstruction seems to stall - it gets through all but 1000 of the first half set, but then does not proceed any further.

No error messages in any of the logs as far as I can see, and the preceding 3D-flex train job finished smoothly. Is this a known issue, and is there a workaround?

Cheers
OIi

apunjani · December 16, 2022, 6:19pm

Hi @olibclarke is the reconstruction job still running (i.e. stalled) or did it fail/did you kill it? If still running can you check the CPU and GPU ram utilization? Did the entire system slow down or just the one job got stuck?
Also, how many particles in the dataset and how much CPU RAM in total in the system?

Thanks!

jcoleman · December 16, 2022, 7:53pm

Hi, we are also having problems with GPU memory. Workstation has 2080 Tis, after a few hundred rnds of training we get this:

  File "cryosparc_master/cryosparc_compute/run.py", line 93, in cryosparc_compute.run.main
  File "cryosparc_master/cryosparc_compute/jobs/flex_refine/run_train.py", line 192, in cryosparc_compute.jobs.flex_refine.run_train.run
  File "cryosparc_master/cryosparc_compute/jobs/flex_refine/flexmod.py", line 1045, in cryosparc_compute.jobs.flex_refine.flexmod.run_flex_opt
  File "cryosparc_master/cryosparc_compute/jobs/flex_refine/flexmod.py", line 1059, in cryosparc_compute.jobs.flex_refine.flexmod.run_flex_opt
  File "/data/opt/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "cryosparc_master/cryosparc_compute/jobs/flex_refine/flexmod.py", line 687, in cryosparc_compute.jobs.flex_refine.flexmod.TetraSV.forward
  File "cryosparc_master/cryosparc_compute/jobs/flex_refine/flexmod.py", line 663, in cryosparc_compute.jobs.flex_refine.flexmod.TetraSVFunction.forward
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.15 GiB (GPU 0; 10.76 GiB total capacity; 7.15 GiB already allocated; 940.94 MiB free; 9.57 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Looking the memory usage prior, it is quite tight:

| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:09:00.0 Off |                  N/A |
| 54%   86C    P2   221W / 250W |   9786MiB / 11264MiB |     99%      Default |
|                               |                      |                  N/A |

I have tried boxing down to 256 and 128 pix but no difference. Total particle stack is 143k.

olibclarke · December 17, 2022, 1:36am

Sorry Ali just saw this - I killed it. Didn’t seem to be slowing down the system, just that one job was stuck. The job had 126000 particles, and the system has 256G RAM.

Cheers
Oli

SVB · December 19, 2022, 12:34pm

We have the same problem. We have 3080 Ti.

  File "cryosparc_master/cryosparc_compute/run.py", line 93, in cryosparc_compute.run.main
  File "/home/lnd/Cryosparc/cryosparc_worker/cryosparc_compute/jobs/flex_refine/run_highres.py", line 150, in run
    flexmod.do_hr_refinement_flex(numiter=params['flex_bfgs_num_iters'])
  File "cryosparc_master/cryosparc_compute/jobs/flex_refine/flexmod.py", line 1640, in cryosparc_compute.jobs.flex_refine.flexmod.do_hr_refinement_flex.lambda7
  File "/home/lnd/Cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/scipy/optimize/lbfgsb.py", line 198, in fmin_l_bfgs_b
    **opts)
  File "/home/lnd/Cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/scipy/optimize/lbfgsb.py", line 308, in _minimize_lbfgsb
    finite_diff_rel_step=finite_diff_rel_step)
  File "/home/lnd/Cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/scipy/optimize/optimize.py", line 262, in _prepare_scalar_function
    finite_diff_rel_step, bounds, epsilon=epsilon)
  File "/home/lnd/Cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/scipy/optimize/_differentiable_functions.py", line 140, in __init__
    self._update_fun()
  File "/home/lnd/Cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/scipy/optimize/_differentiable_functions.py", line 233, in _update_fun
    self._update_fun_impl()
  File "/home/lnd/Cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/scipy/optimize/_differentiable_functions.py", line 137, in update_fun
    self.f = fun_wrapped(self.x)
  File "/home/lnd/Cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/scipy/optimize/_differentiable_functions.py", line 134, in fun_wrapped
    return fun(np.copy(x), *args)
  File "/home/lnd/Cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/scipy/optimize/optimize.py", line 74, in __call__
    self._compute_if_needed(x, *args)
  File "/home/lnd/Cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/scipy/optimize/optimize.py", line 68, in _compute_if_needed
    fg = self.fun(x, *args)
  File "cryosparc_master/cryosparc_compute/jobs/flex_refine/flexmod.py", line 1640, in cryosparc_compute.jobs.flex_refine.flexmod.do_hr_refinement_flex.lambda7
  File "cryosparc_master/cryosparc_compute/jobs/flex_refine/flexmod.py", line 1611, in cryosparc_compute.jobs.flex_refine.flexmod.errfunc_flex
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.57 GiB (GPU 1; 11.77 GiB total capacity; 9.33 GiB already allocated; 980.38 MiB free; 10.13 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

stavros · April 27, 2023, 8:27am

Any updates on this @apunjani ? Reconstruction with a box of 340pixels and 35000 particles shouldn’t give Out of Memory errors, especially since the setup has 128GB per gpu node.