Crash when using larger box

Leonid · November 20, 2018, 1:28pm

Hi, we use cryosparc 2.4.2 on cluster nodes, each node with 4 X 1080ti, 20 CPUs and 256 GB of memory. Nodes run CUDA9.1 and submit is via SLURM.
Most things work when using 216 pixel box, but so far any attempt with homo_refinement with 432 pix box failed:
Initially “no heartbeat received in 30 seconds” error appeared after 1st iteration, at “Computing FSCs” step of second iteration.
I tried to increase memory allocation in cluster.sh, so that instead of requested 24GB it gets 72GB - now “LogicError: cuCtxCreate failed: invalid device ordinal” appears earlier, at initial scale estimation stage.
Also, “LogicError: cuCtxCreate failed: invalid device ordinal” error appears even with 216 box when trying Non-uniform refinement.

Perhaps internal “ram_gb” parameter can be increased?

Any tips what can be tried highly appreciated!
Many thanks for any help!

olibclarke · November 20, 2018, 6:19pm

I don’t think this can be purely a box size issue - I’m routinely using 512-px box size datasets on a workstation with dual Titan-X cards and 256GB RAM

Leonid · November 21, 2018, 11:29am

Hi, thanks for info - can you share your cluster_script.sh? Not sure what can be different in our cluster set-up - the same dataset refines with 216 box, but not with 432.
The nodes are shared with other jobs mostly running relion, but slurm should not allow any oversubscription of CPUs or memory.
Thanks!

olibclarke · November 21, 2018, 3:15pm

Hmmm I don’t have a cluster_script.sh, or if I do I have not edited it - this is a GPU workstation, not a cluster. Maybe that is part of the issue?

Cheers
Oli