3d Refinement-memory error/bug submission script

structure · March 28, 2021, 3:49am

Our system admin told us there is a possible bug in submission in refinement jobs. According to the system admin, the cryosparc doesn’t scale accordingly and dont utilizes the memory available. He changed the submission script to increase the memory, we can run few job but still the refinement jobs at 500pix size box fails.
Can someone clarify what is the problem, I could not find any other post detailing it.
For example: the current job with 560pix box size ( 0.453A/pix). always end with the following error.
[CPU: 18.33 GB] Done in 45.905s.

[CPU: 18.33 GB] Outputting files…

[CPU: 20.50 GB] Done in 29.201s.

[CPU: 20.50 GB] Done iteration 0 in 367.868s. Total time so far 1830.583s

[CPU: 20.50 GB] ----------------------------- Start Iteration 1

[CPU: 20.50 GB] Using Max Alignment Radius 27.904 (9.094A)

[CPU: 20.50 GB] Auto batchsize: 19102 in each split

[CPU: 20.50 GB] Using dynamic mask.

[CPU: 21.33 GB] – THR 0 BATCH 500 NUM 9602 TOTAL 38.713366 ELAPSED 85.047714 –

[CPU: 21.05 GB] Processed 38204.000 images in 89.153s.

[CPU: 21.05 GB] Computing FSCs…

[CPU: 19.6 MB] ====== Job process terminated abnormally.

The intial submission look like this: The admin changed the submission script to scale to 30G but initially the script aked for only 20G.
#SBATCH --job-name cryosparc_P10_J259
#SBATCH -n 4
#SBATCH --gres=gpu:1
#SBATCH -p cryoem
#SBATCH --mem=30G
#SBATCH --time=4-00:00:00

spunjani · March 28, 2021, 4:32pm

@structure, can you please provide some more details?

Which refinement job type(s)
Any error messages or tracebacks you receive in the cryoSPARC interface?
Which GPUs are you using?
How many particles in your dataset, for the jobs that result in ‘Job process terminated abnormally’?
You mentioned box size 500px and 560px, does the same thing happen if you use a smaller box?
Thanks!

structure · March 30, 2021, 5:15pm

Hi @spunjani.

We are currently running v3.0.1.
It failed with both homogenous or Non-uniform refinements with 500 box size and above it always ends up with the same error. No other error message or tracebacks received. small box size of 184 pix worked.
Last refinement I tried was with 560pix box size ( 0.453pix/A) with 200k particles, which ended with Job process terminated abnormally.
After scaling up the memory requirement by admin from 24G to 30G, I could run a homogenous refinement job with 460pix box size ( 200K particles). But the non-uniform refinement failed with a similar message.
On our cluster we have different nodes each node with 4 RTX 6000(24 GB) or 4 RTX 2080 Ti (11GB) or 4 Nvidia Tesla v100s (32 GB).

Thanks!