Job 'NU-refinement new' fails

eMKiso · September 20, 2021, 10:11am

Dear all,

we have some date where the ‘NU-refinement new’ fails:
Log from the UI:

[CPU: 17.60 GB] Done in 8.685s.
[CPU: 17.60 GB] Done iteration 4 in 2682.526s. Total time so far 7665.644s
[CPU: 17.60 GB] ----------------------------- Start Iteration 5
[CPU: 17.60 GB] Using Max Alignment Radius 59.559 (3.828A)
[CPU: 17.60 GB] Using Full Dataset (split 29898 in A, 29898 in B)
[CPU: 17.60 GB] Using dynamic mask.
[CPU: 17.60 GB] – THR 1 BATCH 500 NUM 7449 TOTAL 519.89899 ELAPSED 2182.4992 –
[CPU: 18.83 GB] Processed 59796.000 images in 2183.337s.
[CPU: 18.81 GB] Computing FSCs…
[CPU: 88.2 MB] ====== Job process terminated abnormally.

cryosparcm log output:

global compute_resid_pow with (449, 1, 16, 4) 5582
block size 256 grid size (449, 1, 1)
global compute_resid_pow with (449, 1, 8, 4) 5582
block size 128 grid size (449, 8, 1)
global compute_resid_pow with (449, 1, 19, 21) 5582
block size 256 grid size (449, 2, 1)
global compute_resid_pow with (449, 1, 19, 21) 5582
block size 256 grid size (449, 2, 1)
exception in force_free_cufft_plan:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
FSC No-Mask… 0.143 at 51.755 radwn. 0.5 at 43.801 radwn. Took 4.352s.
FSC Spherical Mask… ========= sending heartbeat
0.143 at 56.936 radwn. 0.5 at 44.832 radwn. Took 5.974s.
FSC Loose Mask… ========= sending heartbeat
========= sending heartbeat
0.143 at 58.276 radwn. 0.5 at 45.272 radwn. Took 12.845s.
FSC Tight Mask… ========= sending heartbeat
0.143 at 60.990 radwn. 0.5 at 52.284 radwn. Took 12.545s.
FSC Noise Sub… ========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= main process now complete.
========= monitor process now complete.

There are some ‘exeption’.

CS version 3.2.0 (not patched)

Any ideas?
Thanks!

spunjani · September 22, 2021, 4:32pm

Dear @eMKiso, It may be that the job was terminated by you cluster scheduler for insufficient resources, most likely RAM.

eMKiso · September 22, 2021, 4:49pm

Hi @spunjani,

the nodes have 128 GB of RAM.
Is it possible that it is too little?

cryoSPARC does not seem to request much RAM:

CPU : [0, 1, 2, 3]
GPU : [0]
RAM : [0, 1, 2]
SSD : False

We use SLURM.

Anyway to test if RAM is the issue?

Best!

eMKiso · September 22, 2021, 4:52pm

Oh sorry it really seems to be an issue with RAM.
I just found this in the SLURM output:

slurmstepd: error: Detected 1 oom-kill event(s) in StepId=27898624.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

OK I suppose this is really a RAM issue… Can just someone confirm this based on the above error?

Best!

eMKiso · September 23, 2021, 6:41am

Dear @spunjani,

but now that I look more into details I see that cryoSPARC reserved ‘only’ 25 GB of RAM through the SLURM sbatch system.

#SBATCH --mem=24576

So it seems that the job was terminated because SLURM detected that the CS job exceeded the RAM usage. Can we fix this somehow?

Thank you!

hsnyder · September 23, 2021, 6:44pm

Hi @eMKiso,

If you go into your installation directory, into cryosparc_worker/cryosparc_compute/jobs/nonuniform_refine and open “build.py” in a text editor, near the very bottom you’ll find a line that reads

job.set_resources_needed(4, 1, 24000, params.get('compute_use_ssd', True))

The second argument – 24000 – is the amount of RAM the job will request (in MiB). If you change that number, the amount of RAM requested from SLURM should be adjusted as well. You could try increasing it to 65000 or so. You may have to run cryosparcm cli "reload()" before the changes take effect.

Hope this helps!
–Harris

eMKiso · September 27, 2021, 6:25pm

Hi @hsnyder,

I tried thi, changed it to 65000, ran another NU refine but the requested RAM was again 24567. I did the cryosparcm cli "reload()"
SLURM reported the out-of-memory error.

I didn’t restart the CS with the cryosparcm restart. Is that also necessary?

Best!

hsnyder · September 27, 2021, 7:49pm

Hi @eMKiso,

Oh, sorry about that! Yes, try restarting cryosparc. If it still doesn’t work, let me know (I would consider that a bug).

–Harris

eMKiso · September 28, 2021, 6:26am

Hi @hsnyder,

I restarted the CS and at the same time applied the latest patch, but I still see #SBATCH --mem=24576 in the SLURM script.

Here is a copy of the last line in the modified file …/cryosparc_worker/cryosparc_compute/jobs/nonuniform_refine/build.py

job.set_resources_needed(4, 1, 65000, params.get('compute_use_ssd', True))

We would be really happy if you are able to figure this out. This is happening on a cluster we recently started using. We process similar data in house on a older workstation and we have never had this issue.

Best regards!

hsnyder · September 28, 2021, 10:08pm

Hi @eMKiso,

My colleague investigated this today and I think the confusion was on my end. Very sorry about that. The change you made to the build.py file looks perfect, keep that. Now try:

cryosparcm cli "refresh_job_types()"

Then, clone the job (rather than just clearing and re-running it). The new job should run with the updated memory amount.

Let me know how that goes, and sorry again for the confusion.

Harris

eMKiso · September 29, 2021, 7:01am

Hi @hsnyder,
no problem. I just tried the command to refresh the job types.
After the refresh I still see the same amount of RAM requested: #SBATCH --mem=24576.

I tried by cloning a job and I also built a new NU Refinement (NEW!) job.
I restarted cryosparc with cryosparcm restart.

I also checked the cluster_script.sh in the folder .../cryosparc/cryosparc_master. I see there #SBATCH --mem={{ (ram_gb*1024)|int }}. That is like in the template on the cryosparc site so it is probably OK.

Is it possible that this build.py file should actually be used to build a ‘package’ that is then used to run a command? Is build.py file just a set of instructions that is used during compilation? Sorry if I am talking nonsense, just thinking ‘out loud’.

Best!

hsnyder · September 29, 2021, 2:51pm

Hi @eMKiso,

I think I’ve figured out what’s wrong—and it’s my fault again—I was having you edit the wrong builder file. The actual job builder for “NU-refine new” is located in jobs/refine/newbuild.py. The line to edit should be line 176 if you’re on the latest patch. That line will read job.set_resources_needed(4, 1, 24000, params.get('compute_use_ssd', True)), and the 24000 needs to be changed, as before. Once that’s done, refesh job types and clone the job again.

Let me know if you have any additional issues, and sorry for the confusion.

–Harris

eMKiso · September 29, 2021, 7:12pm

Hi @hsnyder,

sorry to report, no success.
cryosparc always requests the same amount of RAM in the SLURM script. Also in my case the line in the newbuild.py was not 176 but around 180 (sorry forgot the exact number now).

For now I have circumvented the issue by manually setting the requested RAM to 60000 in the SLURM script for all cryosparc jobs… Not great but at least we will be able to test if RAM is really the issue and continue with data processing.

I don’t know what is wrong but whatever I changed in the .py files nothing changed the amount of requestedRAM in the SLURM script.

I am happy to troubleshoot further. But it seems to be specific to this cluster or something. Otherwise there would be more reports of failed jobs…

Best!