Patch Motion is frequently broken

LilyLily · September 18, 2024, 3:36pm

Hi everyone,

I am running the Patch Motion correction job (v4.5.3), it frequently and suddenly interrupts, and web pages fail to load. A lock file is generated in the tmp folder. Yesterday, it interrupted when processing 6,000 images. Today, it interrupted when separately processing 120, 40, and 20 images. The error messages are pasted below:
“**** Kill signal sent by CryoSPARC (ID: ) ****
Job is unresponsive - no heartbeat received in 180 seconds.”

What could be causing this? Additionally, how can I completely remove everything related to CryoSPARC from our workstation to reinstall it?

Thanks a lot!

wtempel · September 18, 2024, 7:33pm

What is the name of that lock file?

Please can you post the outputs of these commands (on the CryoSPARC master):

projectid=P99 # replace with actual project ID
jobid=J199 # replace with actual job ID
free -h
sudo journalctl | grep -i oom
cryosparcm cli "get_scheduler_targets()"
cryosparcm eventlog $projectid $jobid | head -n 40
cryosparcm joblog $projectid $jobid | tail -n 20

LilyLily · September 18, 2024, 8:27pm

Thank you for much for your reply.

I have posted lock file’s name as below:

cryosparc-supervisor-86986eb42f2c678a4816f644b3866e26.sock

jobid=J74
~/cryosparc_user/cryosparc/cryosparc_master$ free -h
total used free shared buff/cache available
Mem: 125Gi 7.8Gi 6.5Gi 473Mi 110Gi 115Gi
Swap: 9Gi 13Mi 9Gi

Sep 18 13:26:30 systemd[1]: Starting Userspace Out-Of-Memory (OOM) Killer…
Sep 18 13:26:30 systemd[1]: Started Userspace Out-Of-Memory (OOM) Killer.
Sep 18 11:08:16 systemd-oomd[1619]: Killed /user.slice/user-1000.slice/user@1000.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-xxx.scope due to memory pressure for /user.slice/user-1000.slice/user@1000.service being 91.93% > 50.00% for > 20s with reclaim activity
Sep 18 11:08:16 systemd[2459]: vte-spawn-xxx.scope: systemd-oomd killed 36 process(es) in this unit.

It seems that this is a memory issue. How can I fix it without upgrading the workstation or reducing usage GPUs?

many thanks!

LilyLily · September 19, 2024, 1:23pm

Update: If I use only one GPU (4090), it works, but if I use two GPUs, it breaks immediately!

Thanks!

wtempel · September 19, 2024, 1:51pm

Do the job failures occur around the time indicated in the output of the command

sudo journalctl | grep -i oom

?
If they do, you may want to look at Refused connection when cryosparc is running - #25 by wtempel.

LilyLily · September 19, 2024, 2:21pm

Thanks, I have pasted partial information about that!
Sep 18 13:26:30 systemd[1]: Starting Userspace Out-Of-Memory (OOM) Killer…
Sep 18 13:26:30 systemd[1]: Started Userspace Out-Of-Memory (OOM) Killer.
Sep 18 11:08:16 systemd-oomd[1619]: Killed /user.slice/user-1000.slice/user@1000.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-xxx.scope due to memory pressure for /user.slice/user-1000.slice/user@1000.service being 91.93% > 50.00% for > 20s with reclaim activity
Sep 18 11:08:16 systemd[2459]: vte-spawn-xxx.scope: systemd-oomd killed 36 process(es) in this unit.

Thanks!

wtempel · September 19, 2024, 3:30pm

Were any CryoSPARC jobs running (and failing) around that time?

LilyLily · September 19, 2024, 3:38pm

Yes, several similar error messages occurred. I only pasted the most recent one.

wtempel · September 19, 2024, 4:07pm

Unfortunately, I do not have a suggestions that meets all requirements in

LilyLily · September 19, 2024, 4:27pm

Could you please give me other suggestions that don’t have to meet the requirements?

Many thanks!

Andre · September 24, 2024, 5:01am

I’m having a similar issue with Patch Motion Correction, with error message “Job is unresponsive - no heartbeat received in 180 seconds.”

Here is the result of

cryosparcm cli "get_scheduler_targets()"
cryosparcm eventlog $projectid $jobid | head -n 40
cryosparcm joblog $projectid $jobid | tail -n 20

[{'cache_path': '/mnt/Scratch/cryosparc_cache', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 11539054592, 'name': 'NVIDIA GeForce RTX 2080 Ti'}, {'id': 1, 'mem': 11539054592, 'name': 'NVIDIA GeForce RTX 2080 Ti'}, {'id': 2, 'mem': 11539054592, 'name': 'NVIDIA GeForce RTX 2080 Ti'}, {'id': 3, 'mem': 11539054592, 'name': 'NVIDIA GeForce RTX 2080 Ti'}], 'hostname': '10.250.102.2', 'lane': 'Amphitrite', 'monitor_port': None, 'name': '10.250.102.2', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79], 'GPU': [0, 1, 2, 3], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]}, 'ssh_str': 'cryosparcuser@10.250.102.2', 'title': 'Worker node 10.250.102.2', 'type': 'node', 'worker_bin_path': '/home/cryosparcuser/cryosparc/cryosparc_worker/bin/cryosparcw'}]
[Tue, 24 Sep 2024 00:44:30 GMT]  License is valid.
[Tue, 24 Sep 2024 00:44:30 GMT]  Launching job on lane Amphitrite target 10.250.102.2 ...
[Tue, 24 Sep 2024 00:44:30 GMT]  Running job on remote worker node hostname 10.250.102.2
[Tue, 24 Sep 2024 00:44:33 GMT] [CPU RAM used: 80 MB] Job J203 Started
[Tue, 24 Sep 2024 00:44:33 GMT] [CPU RAM used: 80 MB] Master running v4.5.3, worker running v4.5.3
[Tue, 24 Sep 2024 00:44:33 GMT] [CPU RAM used: 80 MB] Working in directory: /home/cryosparcuser/Chiron/Shelly_wickham/Jiahe/Cryosparc/Cryosparc_projects/CS-2x2cube/J203
[Tue, 24 Sep 2024 00:44:33 GMT] [CPU RAM used: 80 MB] Running on lane Amphitrite
[Tue, 24 Sep 2024 00:44:33 GMT] [CPU RAM used: 80 MB] Resources allocated:
[Tue, 24 Sep 2024 00:44:33 GMT] [CPU RAM used: 80 MB]   Worker:  10.250.102.2
[Tue, 24 Sep 2024 00:44:33 GMT] [CPU RAM used: 80 MB]   CPU   :  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
[Tue, 24 Sep 2024 00:44:33 GMT] [CPU RAM used: 80 MB]   GPU   :  [0, 1, 2, 3]
[Tue, 24 Sep 2024 00:44:33 GMT] [CPU RAM used: 80 MB]   RAM   :  [0, 1, 2, 3, 4, 5, 6, 7]
[Tue, 24 Sep 2024 00:44:33 GMT] [CPU RAM used: 80 MB]   SSD   :  False
[Tue, 24 Sep 2024 00:44:33 GMT] [CPU RAM used: 80 MB] --------------------------------------------------------------
[Tue, 24 Sep 2024 00:44:33 GMT] [CPU RAM used: 80 MB] Importing job module for job type patch_motion_correction_multi...
[Tue, 24 Sep 2024 00:44:38 GMT] [CPU RAM used: 211 MB] Job ready to run
[Tue, 24 Sep 2024 00:44:38 GMT] [CPU RAM used: 211 MB] ***************************************************************
[Tue, 24 Sep 2024 00:44:38 GMT] [CPU RAM used: 211 MB] Job will process this many movies:  971
[Tue, 24 Sep 2024 00:44:38 GMT] [CPU RAM used: 211 MB] Job will output denoiser training data for this many movies:  200
[Tue, 24 Sep 2024 00:44:38 GMT] [CPU RAM used: 211 MB] Random seed: 690708223
[Tue, 24 Sep 2024 00:44:38 GMT] [CPU RAM used: 211 MB] parent process is 57387
[Tue, 24 Sep 2024 00:44:38 GMT] [CPU RAM used: 163 MB] Calling CUDA init from 57453
[Tue, 24 Sep 2024 00:44:38 GMT] [CPU RAM used: 163 MB] Calling CUDA init from 57456
[Tue, 24 Sep 2024 00:44:38 GMT] [CPU RAM used: 163 MB] Calling CUDA init from 57454
[Tue, 24 Sep 2024 00:44:38 GMT] [CPU RAM used: 163 MB] Calling CUDA init from 57455
[Tue, 24 Sep 2024 00:44:39 GMT] [CPU RAM used: 294 MB] -- 2.0: processing 4 of 971: J195/imported/001556738385385257523_FoilHole_975794_Data_974488_57_20240901_133318_Fractions.mrc
        loading /home/cryosparcuser/Chiron/Shelly_wickham/Jiahe/Cryosparc/Cryosparc_projects/CS-2x2cube/J195/imported/001556738385385257523_FoilHole_975794_Data_974488_57_20240901_133318_Fractions.mrc
        Loading raw movie data from J195/imported/001556738385385257523_FoilHole_975794_Data_974488_57_20240901_133318_Fractions.mrc ...
        Done in 15.53s
        Processing ...
        Done in 17.77s
        Completed rigid and patch motion with (Z:5,Y:8,X:8) knots
        Writing non-dose-weighted result to J203/motioncorrected/001556738385385257523_FoilHole_975794_Data_974488_57_20240901_133318_Fractions_patch_aligned.mrc ...
        Done in 0.18s
        Writing 120x120 micrograph thumbnail to J203/thumbnails/001556738385385257523_FoilHole_975794_Data_974488_57_20240901_133318_Fractions_thumb_@1x.png ...
        Done in 0.01s
        Writing 240x240 micrograph thumbnail to J203/thumbnails/001556738385385257523_FoilHole_975794_Data_974488_57_20240901_133318_Fractions_thumb_@2x.png ...
        Done in 0.01s
        Writing dose-weighted result to J203/motioncorrected/001556738385385257523_FoilHole_975794_Data_974488_57_20240901_133318_Fractions_patch_aligned_doseweighted.mrc ...
        Done in 0.15s
Traceback (most recent call last):
  File "<string>", line 9, in <module>
BrokenPipeError: [Errno 32] Broken pipe
========= sending heartbeat at 2024-09-24 11:55:54.559373
========= sending heartbeat at 2024-09-24 11:56:04.579781
========= sending heartbeat at 2024-09-24 11:56:14.600115
========= sending heartbeat at 2024-09-24 11:56:24.620202
========= sending heartbeat at 2024-09-24 11:56:34.640542
========= sending heartbeat at 2024-09-24 11:56:44.662026
========= sending heartbeat at 2024-09-24 11:56:54.682136
========= sending heartbeat at 2024-09-24 11:57:04.701806
========= sending heartbeat at 2024-09-24 11:57:14.720815
========= sending heartbeat at 2024-09-24 11:57:24.741521
========= sending heartbeat at 2024-09-24 11:57:34.761958
========= sending heartbeat at 2024-09-24 11:57:44.782397
========= sending heartbeat at 2024-09-24 11:57:54.803461
========= sending heartbeat at 2024-09-24 11:58:04.824165
========= sending heartbeat at 2024-09-24 11:58:14.844469
========= sending heartbeat at 2024-09-24 11:58:24.864399
========= sending heartbeat at 2024-09-24 11:58:34.885077
========= sending heartbeat at 2024-09-24 11:58:44.906332
========= sending heartbeat at 2024-09-24 11:58:54.926720
/home/cryosparcuser/cryosparc/cryosparc_worker/bin/cryosparcw: line 150: 57387 Terminated              python -c "import cryosparc_compute.run as run; run.run()" "$@"

There was no “Out of memory” errors when the job failed (from the sudo journalctl | grep -i oom)
When we try running the job again, sometimes it fails (at a different time) sometimes it finishes.

Any idea on how we could troubleshoot what’s causing the heartbeat to be lost?

wtempel · October 1, 2024, 4:58pm

@Andre Thanks for posting this information.
Please can you post the output of the commands

on the CryoSPARC master (replacing P99 with the actual project ID)

cryosparcm cli "get_job('P99', 'J203', 'job_type', 'version', 'instance_information', 'status', 'killed_at', 'started_at', 'failed_at', 'params_spec', 'cloned_from', 'PID_main', 'PID_monitor')"

on the CryoSPARC worker computer where job J203 failed

hostname -f
cat /sys/kernel/mm/transparent_hugepage/enabled 
cat /proc/cmdline

Andre · October 11, 2024, 2:20am

@wtempel Thanks for following up on this. Here is the result for the CryoSPARC master

[cryosparcuser@amphitrite ~]$ cryosparcm cli "get_job('P105', 'J203', 'job_type', 'version', 'instance_information', 'status', 'killed_at', 'started_at', 'failed_at', 'params_spec', 'cloned_from', 'PID_main', 'PID_monitor')"
{'PID_main': 57387, 'PID_monitor': 57393, '_id': '66f20b502b5694c187be0742', 'cloned_from': None, 'failed_at': 'Tue, 24 Sep 2024 02:02:01 GMT', 'instance_information': {'CUDA_version': '11.8', 'available_memory': '243.45GB', 'cpu_model': 'Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz', 'driver_version': '12.4', 'gpu_info': [{'id': 0, 'mem': 11539054592, 'name': 'NVIDIA GeForce RTX 2080 Ti', 'pcie': '0000:18:00'}, {'id': 1, 'mem': 11539054592, 'name': 'NVIDIA GeForce RTX 2080 Ti', 'pcie': '0000:3b:00'}, {'id': 2, 'mem': 11539054592, 'name': 'NVIDIA GeForce RTX 2080 Ti', 'pcie': '0000:86:00'}, {'id': 3, 'mem': 11539054592, 'name': 'NVIDIA GeForce RTX 2080 Ti', 'pcie': '0000:af:00'}], 'ofd_hard_limit': 4096, 'ofd_soft_limit': 1024, 'physical_cores': 40, 'platform_architecture': 'x86_64', 'platform_node': 'amphitrite.research.sydney.edu.au', 'platform_release': '3.10.0-1160.53.1.el7.x86_64', 'platform_version': '#1 SMP Fri Jan 14 13:59:45 UTC 2022', 'total_memory': '251.35GB', 'used_memory': '6.94GB'}, 'job_type': 'patch_motion_correction_multi', 'killed_at': 'Tue, 24 Sep 2024 02:02:00 GMT', 'params_spec': {'compute_num_gpus': {'value': 4}}, 'project_uid': 'P105', 'started_at': 'Tue, 24 Sep 2024 00:44:33 GMT', 'status': 'failed', 'uid': 'J203', 'version': 'v4.5.3'}

And here is the result for the worker

[cryosparcuser@amphitrite ~]$ hostname -f
amphitrite.research.sydney.edu.au
[cryosparcuser@amphitrite ~]$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
[cryosparcuser@amphitrite ~]$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-3.10.0-1160.53.1.el7.x86_64 root=/dev/mapper/centos-root ro crashkernel=auto rd.lvm.lv=centos/root rd.lvm.lv=centos/swap rhgb modprobe.blacklist=nouveau LANG=en_AU.UTF-8 nouveau.modeset=0 rd.driver.blacklist=nouveau

wtempel · October 11, 2024, 6:19pm

Please can you test if running the commands (details)

sudo sh -c "echo never > /sys/kernel/mm/transparent_hugepage/enabled"
sudo sh -c "echo never > /sys/kernel/mm/transparent_hugepage/defrag"

before running the job resolves the issue?
These commands would change the transparent_hugepage settings temporarily and should lose their effect during a system reboot.

Andre · October 14, 2024, 5:03am

@wtempel Do I run this on the worker or the master?

Also note that this issue is inconsistent. When I rerun this job, it could be successful even without this, which means that a successful run doesn’t mean this fixed the issue.

wtempel · October 18, 2024, 6:44pm

For the test, transparent_hugepage would need to be disabled on the worker.

Please let us know when you

encounter the issue again and
confirmed that the command (on the applicable worker)
```
cat /sys/kernel/mm/transparent_hugepage/enabled
```
shows
```
always madvise [never]
```
when the issue occurs, that is, transparent_hugepage has in fact been disabled on the relevant worker and not reenabled (automatically when the system is running or during system startup).