we do have an issue when running refinements on our cluster. It seems like the job seems to need more memory than it specified in the submission script and then dies due to cluster regulations.
We now hardcoded the memory usage to 32 GB and it run indeed further. The slurm output tells us it indeed consumes more mory the the specified 24 GB. However now we got stock with a new error:
Traceback (most recent call last):
File "cryosparc2_compute/jobs/runcommon.py", line 1490, in run_with_except_hook
run_old(*args, **kw)
File "cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py", line 110, in cryosparc2_compute.engine.cuda_core.GPUThread.run
File "cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py", line 111, in cryosparc2_compute.engine.cuda_core.GPUThread.run
File "cryosparc2_worker/cryosparc2_compute/engine/engine.py", line 1072, in cryosparc2_compute.engine.engine.process.work
File "cryosparc2_worker/cryosparc2_compute/engine/engine.py", line 392, in cryosparc2_compute.engine.engine.EngineThread.find_and_set_best_pose_shift
ValueError: invalid entry in index array
Hi @david.haselbach, can you confirm which type of refinement job this is? is it a “legacy refinement” or a “new refinement” (i.e. with CTF refinement, in v2.12+) ? Also can you tell us the GPU model and CUDA version that was running on the node where the invalid entry in index array error occurred?
It’s definitely true that some of the newer job types use more memory than they should (i.e. more than is requested from SLURM). We are working on optimizing the memory usage to fit back within the requested amounts.
it was a legacy refinement.
The index error happend on node which has 8x NVIDIA GP100GL [Tesla P100 PCIe 12GB] cards.
Our crysoparc worker is compiled against Cuda 9.2.88.
Is there a chance the particles going into this refinement job came from Topaz in the latest cryoSPARC versions? This issue may be related: Topaz 2D Class problem
Hi - We’re having the same issue with particles that were picked with the traditional template picker and doing refinement with the new homogeneous refinement. Our slurm job seems to fail after about iteration 5 with an out of memory message, but no sign that the system actually ran out of memory. We are using 2x Tesla V100 GPUs. Is there something in the sbatch script for slurm that we need to tweak?
@hansenbry the CPU RAM usage of new homogeneous refinement has been substantially reduced in v2.13 (out today) so could you try this and see if that helps?
It’s likely though that it’s a good idea to increase the CPU memory requirement specified in the sbatch script as David has done, since depending on parameters, the jobs do sometimes need more RAM than the default value… unfortunately we haven’t yet had a chance to go through all the jobs and pre-compute the amount of CPU RAM that will be needed before the job runs.
Hi we again run into memory issues and have hardcoded even higher memory now. Is there a possibility to have the memory as advanced options such that it can be provided by the user. This would help us a lot.
I guess most user’s wouldn’t know exactly, but at least there is the possibility to find it out via trial and error. We really have a number of refinements that just die with the automatically set memory and just run through when we hardcode the memory in the submission script. And changing of this can only done by our administrator which leads to quite some time lack, sometimes.
Has the memory calculation been reworked in the meantime?
Cause i also regularly run into the problem when using bigger box-sizes. I even have the default ram_gb already multiplied by 2 but sometimes if you go on really large box-sizes (500-1000) even a factor of 16 was not enough.
I mean i can change it myself easily and typically just add a project_uid if statement if thats the case but then also small jobs like pickers and extraction take that much ram.
we observe the same issue, on sum jobs (must be parameter / input dependent):
Cryosparc estimages mem_gb with 24.0 - the job will run out of memory. When we submit with much larger resources, we see a peak memory consumption of ca 36GB. Job typ is “new_local_refine”.
For the “Local Refinement (New!)” jobs that run into the issue mentioned above I am using the default settings only with particles, a map and a mask as input. The particle box size is 560px and the particle number is 42k. This requires a total of 68 Gb of Ram.
For comparison, it requires 42Gb of Ram when I run the same job with the same particles but with a box size of 480px.
The rather big particle box size is required since it otherwise runs into Nyquist limit.
The initial particles were picked by the template picker and extracted in cs.
Homo Refinement New
Settings: all default
Box-size: 882
NrParticles: 300k
Cryosparc Ram: 0,1,2 of 512GB so i guess 24GB
Slurm MaxRSS: 164890628K ~157GB
The job type-specific RAM usage estimates are for what we consider “typical” use cases.
For larger than “typical” cases, assuming the actual availability of required RAM resources:
slurm must be configured to allow such jobs
a dedicated “large_mem” cluster lane should be added to your cryoSPARC instance (cryosparcm cluster connect) with a suitably multiplied #SBATCH --mem= parameter inside cluster_script.sh, like in this example. Adding a lane instead of replacing the existing lane has the advantage that cryoSPARC jobs with smaller (“typical”) memory requirements won’t have to “wait” for the availability of large memory resources.