When the input cropped particles are pretty large at full size (450px), 3D-flex reconstruction seems to stall - it gets through all but 1000 of the first half set, but then does not proceed any further.
No error messages in any of the logs as far as I can see, and the preceding 3D-flex train job finished smoothly. Is this a known issue, and is there a workaround?
Hi @olibclarke is the reconstruction job still running (i.e. stalled) or did it fail/did you kill it? If still running can you check the CPU and GPU ram utilization? Did the entire system slow down or just the one job got stuck?
Also, how many particles in the dataset and how much CPU RAM in total in the system?
Hi, we are also having problems with GPU memory. Workstation has 2080 Tis, after a few hundred rnds of training we get this:
File "cryosparc_master/cryosparc_compute/run.py", line 93, in cryosparc_compute.run.main
File "cryosparc_master/cryosparc_compute/jobs/flex_refine/run_train.py", line 192, in cryosparc_compute.jobs.flex_refine.run_train.run
File "cryosparc_master/cryosparc_compute/jobs/flex_refine/flexmod.py", line 1045, in cryosparc_compute.jobs.flex_refine.flexmod.run_flex_opt
File "cryosparc_master/cryosparc_compute/jobs/flex_refine/flexmod.py", line 1059, in cryosparc_compute.jobs.flex_refine.flexmod.run_flex_opt
File "/data/opt/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "cryosparc_master/cryosparc_compute/jobs/flex_refine/flexmod.py", line 687, in cryosparc_compute.jobs.flex_refine.flexmod.TetraSV.forward
File "cryosparc_master/cryosparc_compute/jobs/flex_refine/flexmod.py", line 663, in cryosparc_compute.jobs.flex_refine.flexmod.TetraSVFunction.forward
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.15 GiB (GPU 0; 10.76 GiB total capacity; 7.15 GiB already allocated; 940.94 MiB free; 9.57 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Looking the memory usage prior, it is quite tight:
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:09:00.0 Off | N/A |
| 54% 86C P2 221W / 250W | 9786MiB / 11264MiB | 99% Default |
| | | N/A |
I have tried boxing down to 256 and 128 pix but no difference. Total particle stack is 143k.
Sorry Ali just saw this - I killed it. Didn’t seem to be slowing down the system, just that one job was stuck. The job had 126000 particles, and the system has 256G RAM.
File "cryosparc_master/cryosparc_compute/run.py", line 93, in cryosparc_compute.run.main
File "/home/lnd/Cryosparc/cryosparc_worker/cryosparc_compute/jobs/flex_refine/run_highres.py", line 150, in run
flexmod.do_hr_refinement_flex(numiter=params['flex_bfgs_num_iters'])
File "cryosparc_master/cryosparc_compute/jobs/flex_refine/flexmod.py", line 1640, in cryosparc_compute.jobs.flex_refine.flexmod.do_hr_refinement_flex.lambda7
File "/home/lnd/Cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/scipy/optimize/lbfgsb.py", line 198, in fmin_l_bfgs_b
**opts)
File "/home/lnd/Cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/scipy/optimize/lbfgsb.py", line 308, in _minimize_lbfgsb
finite_diff_rel_step=finite_diff_rel_step)
File "/home/lnd/Cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/scipy/optimize/optimize.py", line 262, in _prepare_scalar_function
finite_diff_rel_step, bounds, epsilon=epsilon)
File "/home/lnd/Cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/scipy/optimize/_differentiable_functions.py", line 140, in __init__
self._update_fun()
File "/home/lnd/Cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/scipy/optimize/_differentiable_functions.py", line 233, in _update_fun
self._update_fun_impl()
File "/home/lnd/Cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/scipy/optimize/_differentiable_functions.py", line 137, in update_fun
self.f = fun_wrapped(self.x)
File "/home/lnd/Cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/scipy/optimize/_differentiable_functions.py", line 134, in fun_wrapped
return fun(np.copy(x), *args)
File "/home/lnd/Cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/scipy/optimize/optimize.py", line 74, in __call__
self._compute_if_needed(x, *args)
File "/home/lnd/Cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/scipy/optimize/optimize.py", line 68, in _compute_if_needed
fg = self.fun(x, *args)
File "cryosparc_master/cryosparc_compute/jobs/flex_refine/flexmod.py", line 1640, in cryosparc_compute.jobs.flex_refine.flexmod.do_hr_refinement_flex.lambda7
File "cryosparc_master/cryosparc_compute/jobs/flex_refine/flexmod.py", line 1611, in cryosparc_compute.jobs.flex_refine.flexmod.errfunc_flex
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.57 GiB (GPU 1; 11.77 GiB total capacity; 9.33 GiB already allocated; 980.38 MiB free; 10.13 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Any updates on this @apunjani ? Reconstruction with a box of 340pixels and 35000 particles shouldn’t give Out of Memory errors, especially since the setup has 128GB per gpu node.