Our system admin told us there is a possible bug in submission in refinement jobs. According to the system admin, the cryosparc doesn’t scale accordingly and dont utilizes the memory available. He changed the submission script to increase the memory, we can run few job but still the refinement jobs at 500pix size box fails.
Can someone clarify what is the problem, I could not find any other post detailing it.
For example: the current job with 560pix box size ( 0.453A/pix). always end with the following error.
[CPU: 18.33 GB] Done in 45.905s.
[CPU: 18.33 GB] Outputting files…
[CPU: 20.50 GB] Done in 29.201s.
[CPU: 20.50 GB] Done iteration 0 in 367.868s. Total time so far 1830.583s
[CPU: 20.50 GB] ----------------------------- Start Iteration 1
[CPU: 20.50 GB] Using Max Alignment Radius 27.904 (9.094A)
[CPU: 20.50 GB] Auto batchsize: 19102 in each split
[CPU: 20.50 GB] Using dynamic mask.
[CPU: 21.33 GB] – THR 0 BATCH 500 NUM 9602 TOTAL 38.713366 ELAPSED 85.047714 –
[CPU: 21.05 GB] Processed 38204.000 images in 89.153s.
[CPU: 21.05 GB] Computing FSCs…
[CPU: 19.6 MB] ====== Job process terminated abnormally.
The intial submission look like this: The admin changed the submission script to scale to 30G but initially the script aked for only 20G.
#SBATCH --job-name cryosparc_P10_J259
#SBATCH -n 4
#SBATCH --gres=gpu:1
#SBATCH -p cryoem
#SBATCH --mem=30G
#SBATCH --time=4-00:00:00