Patch Motion is frequently broken

Hi everyone,

I am running the Patch Motion correction job (v4.5.3), it frequently and suddenly interrupts, and web pages fail to load. A lock file is generated in the tmp folder. Yesterday, it interrupted when processing 6,000 images. Today, it interrupted when separately processing 120, 40, and 20 images. The error messages are pasted below:
“**** Kill signal sent by CryoSPARC (ID: ) ****
Job is unresponsive - no heartbeat received in 180 seconds.”

What could be causing this? Additionally, how can I completely remove everything related to CryoSPARC from our workstation to reinstall it?

Thanks a lot!

What is the name of that lock file?

Please can you post the outputs of these commands (on the CryoSPARC master):

projectid=P99 # replace with actual project ID
jobid=J199 # replace with actual job ID
free -h
sudo journalctl | grep -i oom
cryosparcm cli "get_scheduler_targets()"
cryosparcm eventlog $projectid $jobid | head -n 40
cryosparcm joblog $projectid $jobid | tail -n 20

Thank you for much for your reply.

  1. I have posted lock file’s name as below:

cryosparc-supervisor-86986eb42f2c678a4816f644b3866e26.sock

  1. jobid=J74
    ~/cryosparc_user/cryosparc/cryosparc_master$ free -h
    total used free shared buff/cache available
    Mem: 125Gi 7.8Gi 6.5Gi 473Mi 110Gi 115Gi
    Swap: 9Gi 13Mi 9Gi

Sep 18 13:26:30 systemd[1]: Starting Userspace Out-Of-Memory (OOM) Killer…
Sep 18 13:26:30 systemd[1]: Started Userspace Out-Of-Memory (OOM) Killer.
Sep 18 11:08:16 systemd-oomd[1619]: Killed /user.slice/user-1000.slice/user@1000.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-xxx.scope due to memory pressure for /user.slice/user-1000.slice/user@1000.service being 91.93% > 50.00% for > 20s with reclaim activity
Sep 18 11:08:16 systemd[2459]: vte-spawn-xxx.scope: systemd-oomd killed 36 process(es) in this unit.

It seems that this is a memory issue. How can I fix it without upgrading the workstation or reducing usage GPUs?

many thanks!

Update: If I use only one GPU (4090), it works, but if I use two GPUs, it breaks immediately!

Thanks!

Do the job failures occur around the time indicated in the output of the command

sudo journalctl | grep -i oom

?
If they do, you may want to look at Refused connection when cryosparc is running - #25 by wtempel.

Thanks, I have pasted partial information about that!
Sep 18 13:26:30 systemd[1]: Starting Userspace Out-Of-Memory (OOM) Killer…
Sep 18 13:26:30 systemd[1]: Started Userspace Out-Of-Memory (OOM) Killer.
Sep 18 11:08:16 systemd-oomd[1619]: Killed /user.slice/user-1000.slice/user@1000.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-xxx.scope due to memory pressure for /user.slice/user-1000.slice/user@1000.service being 91.93% > 50.00% for > 20s with reclaim activity
Sep 18 11:08:16 systemd[2459]: vte-spawn-xxx.scope: systemd-oomd killed 36 process(es) in this unit.

Thanks!

Were any CryoSPARC jobs running (and failing) around that time?

Yes, several similar error messages occurred. I only pasted the most recent one.

Unfortunately, I do not have a suggestions that meets all requirements in

Could you please give me other suggestions that don’t have to meet the requirements?

Many thanks!

I’m having a similar issue with Patch Motion Correction, with error message “Job is unresponsive - no heartbeat received in 180 seconds.”

Here is the result of

cryosparcm cli "get_scheduler_targets()"
cryosparcm eventlog $projectid $jobid | head -n 40
cryosparcm joblog $projectid $jobid | tail -n 20
[{'cache_path': '/mnt/Scratch/cryosparc_cache', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 11539054592, 'name': 'NVIDIA GeForce RTX 2080 Ti'}, {'id': 1, 'mem': 11539054592, 'name': 'NVIDIA GeForce RTX 2080 Ti'}, {'id': 2, 'mem': 11539054592, 'name': 'NVIDIA GeForce RTX 2080 Ti'}, {'id': 3, 'mem': 11539054592, 'name': 'NVIDIA GeForce RTX 2080 Ti'}], 'hostname': '10.250.102.2', 'lane': 'Amphitrite', 'monitor_port': None, 'name': '10.250.102.2', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79], 'GPU': [0, 1, 2, 3], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]}, 'ssh_str': 'cryosparcuser@10.250.102.2', 'title': 'Worker node 10.250.102.2', 'type': 'node', 'worker_bin_path': '/home/cryosparcuser/cryosparc/cryosparc_worker/bin/cryosparcw'}]
[Tue, 24 Sep 2024 00:44:30 GMT]  License is valid.
[Tue, 24 Sep 2024 00:44:30 GMT]  Launching job on lane Amphitrite target 10.250.102.2 ...
[Tue, 24 Sep 2024 00:44:30 GMT]  Running job on remote worker node hostname 10.250.102.2
[Tue, 24 Sep 2024 00:44:33 GMT] [CPU RAM used: 80 MB] Job J203 Started
[Tue, 24 Sep 2024 00:44:33 GMT] [CPU RAM used: 80 MB] Master running v4.5.3, worker running v4.5.3
[Tue, 24 Sep 2024 00:44:33 GMT] [CPU RAM used: 80 MB] Working in directory: /home/cryosparcuser/Chiron/Shelly_wickham/Jiahe/Cryosparc/Cryosparc_projects/CS-2x2cube/J203
[Tue, 24 Sep 2024 00:44:33 GMT] [CPU RAM used: 80 MB] Running on lane Amphitrite
[Tue, 24 Sep 2024 00:44:33 GMT] [CPU RAM used: 80 MB] Resources allocated:
[Tue, 24 Sep 2024 00:44:33 GMT] [CPU RAM used: 80 MB]   Worker:  10.250.102.2
[Tue, 24 Sep 2024 00:44:33 GMT] [CPU RAM used: 80 MB]   CPU   :  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
[Tue, 24 Sep 2024 00:44:33 GMT] [CPU RAM used: 80 MB]   GPU   :  [0, 1, 2, 3]
[Tue, 24 Sep 2024 00:44:33 GMT] [CPU RAM used: 80 MB]   RAM   :  [0, 1, 2, 3, 4, 5, 6, 7]
[Tue, 24 Sep 2024 00:44:33 GMT] [CPU RAM used: 80 MB]   SSD   :  False
[Tue, 24 Sep 2024 00:44:33 GMT] [CPU RAM used: 80 MB] --------------------------------------------------------------
[Tue, 24 Sep 2024 00:44:33 GMT] [CPU RAM used: 80 MB] Importing job module for job type patch_motion_correction_multi...
[Tue, 24 Sep 2024 00:44:38 GMT] [CPU RAM used: 211 MB] Job ready to run
[Tue, 24 Sep 2024 00:44:38 GMT] [CPU RAM used: 211 MB] ***************************************************************
[Tue, 24 Sep 2024 00:44:38 GMT] [CPU RAM used: 211 MB] Job will process this many movies:  971
[Tue, 24 Sep 2024 00:44:38 GMT] [CPU RAM used: 211 MB] Job will output denoiser training data for this many movies:  200
[Tue, 24 Sep 2024 00:44:38 GMT] [CPU RAM used: 211 MB] Random seed: 690708223
[Tue, 24 Sep 2024 00:44:38 GMT] [CPU RAM used: 211 MB] parent process is 57387
[Tue, 24 Sep 2024 00:44:38 GMT] [CPU RAM used: 163 MB] Calling CUDA init from 57453
[Tue, 24 Sep 2024 00:44:38 GMT] [CPU RAM used: 163 MB] Calling CUDA init from 57456
[Tue, 24 Sep 2024 00:44:38 GMT] [CPU RAM used: 163 MB] Calling CUDA init from 57454
[Tue, 24 Sep 2024 00:44:38 GMT] [CPU RAM used: 163 MB] Calling CUDA init from 57455
[Tue, 24 Sep 2024 00:44:39 GMT] [CPU RAM used: 294 MB] -- 2.0: processing 4 of 971: J195/imported/001556738385385257523_FoilHole_975794_Data_974488_57_20240901_133318_Fractions.mrc
        loading /home/cryosparcuser/Chiron/Shelly_wickham/Jiahe/Cryosparc/Cryosparc_projects/CS-2x2cube/J195/imported/001556738385385257523_FoilHole_975794_Data_974488_57_20240901_133318_Fractions.mrc
        Loading raw movie data from J195/imported/001556738385385257523_FoilHole_975794_Data_974488_57_20240901_133318_Fractions.mrc ...
        Done in 15.53s
        Processing ...
        Done in 17.77s
        Completed rigid and patch motion with (Z:5,Y:8,X:8) knots
        Writing non-dose-weighted result to J203/motioncorrected/001556738385385257523_FoilHole_975794_Data_974488_57_20240901_133318_Fractions_patch_aligned.mrc ...
        Done in 0.18s
        Writing 120x120 micrograph thumbnail to J203/thumbnails/001556738385385257523_FoilHole_975794_Data_974488_57_20240901_133318_Fractions_thumb_@1x.png ...
        Done in 0.01s
        Writing 240x240 micrograph thumbnail to J203/thumbnails/001556738385385257523_FoilHole_975794_Data_974488_57_20240901_133318_Fractions_thumb_@2x.png ...
        Done in 0.01s
        Writing dose-weighted result to J203/motioncorrected/001556738385385257523_FoilHole_975794_Data_974488_57_20240901_133318_Fractions_patch_aligned_doseweighted.mrc ...
        Done in 0.15s
Traceback (most recent call last):
  File "<string>", line 9, in <module>
BrokenPipeError: [Errno 32] Broken pipe
========= sending heartbeat at 2024-09-24 11:55:54.559373
========= sending heartbeat at 2024-09-24 11:56:04.579781
========= sending heartbeat at 2024-09-24 11:56:14.600115
========= sending heartbeat at 2024-09-24 11:56:24.620202
========= sending heartbeat at 2024-09-24 11:56:34.640542
========= sending heartbeat at 2024-09-24 11:56:44.662026
========= sending heartbeat at 2024-09-24 11:56:54.682136
========= sending heartbeat at 2024-09-24 11:57:04.701806
========= sending heartbeat at 2024-09-24 11:57:14.720815
========= sending heartbeat at 2024-09-24 11:57:24.741521
========= sending heartbeat at 2024-09-24 11:57:34.761958
========= sending heartbeat at 2024-09-24 11:57:44.782397
========= sending heartbeat at 2024-09-24 11:57:54.803461
========= sending heartbeat at 2024-09-24 11:58:04.824165
========= sending heartbeat at 2024-09-24 11:58:14.844469
========= sending heartbeat at 2024-09-24 11:58:24.864399
========= sending heartbeat at 2024-09-24 11:58:34.885077
========= sending heartbeat at 2024-09-24 11:58:44.906332
========= sending heartbeat at 2024-09-24 11:58:54.926720
/home/cryosparcuser/cryosparc/cryosparc_worker/bin/cryosparcw: line 150: 57387 Terminated              python -c "import cryosparc_compute.run as run; run.run()" "$@"

There was no “Out of memory” errors when the job failed (from the sudo journalctl | grep -i oom)
When we try running the job again, sometimes it fails (at a different time) sometimes it finishes.

Any idea on how we could troubleshoot what’s causing the heartbeat to be lost?

@Andre Thanks for posting this information.
Please can you post the output of the commands

  • on the CryoSPARC master (replacing P99 with the actual project ID)
    cryosparcm cli "get_job('P99', 'J203', 'job_type', 'version', 'instance_information', 'status', 'killed_at', 'started_at', 'failed_at', 'params_spec', 'cloned_from', 'PID_main', 'PID_monitor')"
    
  • on the CryoSPARC worker computer where job J203 failed
    hostname -f
    cat /sys/kernel/mm/transparent_hugepage/enabled 
    cat /proc/cmdline
    

@wtempel Thanks for following up on this. Here is the result for the CryoSPARC master

[cryosparcuser@amphitrite ~]$ cryosparcm cli "get_job('P105', 'J203', 'job_type', 'version', 'instance_information', 'status', 'killed_at', 'started_at', 'failed_at', 'params_spec', 'cloned_from', 'PID_main', 'PID_monitor')"
{'PID_main': 57387, 'PID_monitor': 57393, '_id': '66f20b502b5694c187be0742', 'cloned_from': None, 'failed_at': 'Tue, 24 Sep 2024 02:02:01 GMT', 'instance_information': {'CUDA_version': '11.8', 'available_memory': '243.45GB', 'cpu_model': 'Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz', 'driver_version': '12.4', 'gpu_info': [{'id': 0, 'mem': 11539054592, 'name': 'NVIDIA GeForce RTX 2080 Ti', 'pcie': '0000:18:00'}, {'id': 1, 'mem': 11539054592, 'name': 'NVIDIA GeForce RTX 2080 Ti', 'pcie': '0000:3b:00'}, {'id': 2, 'mem': 11539054592, 'name': 'NVIDIA GeForce RTX 2080 Ti', 'pcie': '0000:86:00'}, {'id': 3, 'mem': 11539054592, 'name': 'NVIDIA GeForce RTX 2080 Ti', 'pcie': '0000:af:00'}], 'ofd_hard_limit': 4096, 'ofd_soft_limit': 1024, 'physical_cores': 40, 'platform_architecture': 'x86_64', 'platform_node': 'amphitrite.research.sydney.edu.au', 'platform_release': '3.10.0-1160.53.1.el7.x86_64', 'platform_version': '#1 SMP Fri Jan 14 13:59:45 UTC 2022', 'total_memory': '251.35GB', 'used_memory': '6.94GB'}, 'job_type': 'patch_motion_correction_multi', 'killed_at': 'Tue, 24 Sep 2024 02:02:00 GMT', 'params_spec': {'compute_num_gpus': {'value': 4}}, 'project_uid': 'P105', 'started_at': 'Tue, 24 Sep 2024 00:44:33 GMT', 'status': 'failed', 'uid': 'J203', 'version': 'v4.5.3'}

And here is the result for the worker

[cryosparcuser@amphitrite ~]$ hostname -f
amphitrite.research.sydney.edu.au
[cryosparcuser@amphitrite ~]$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
[cryosparcuser@amphitrite ~]$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-3.10.0-1160.53.1.el7.x86_64 root=/dev/mapper/centos-root ro crashkernel=auto rd.lvm.lv=centos/root rd.lvm.lv=centos/swap rhgb modprobe.blacklist=nouveau LANG=en_AU.UTF-8 nouveau.modeset=0 rd.driver.blacklist=nouveau

Please can you test if running the commands (details)

sudo sh -c "echo never > /sys/kernel/mm/transparent_hugepage/enabled"
sudo sh -c "echo never > /sys/kernel/mm/transparent_hugepage/defrag"

before running the job resolves the issue?
These commands would change the transparent_hugepage settings temporarily and should lose their effect during a system reboot.

@wtempel Do I run this on the worker or the master?

Also note that this issue is inconsistent. When I rerun this job, it could be successful even without this, which means that a successful run doesn’t mean this fixed the issue.

For the test, transparent_hugepage would need to be disabled on the worker.

Please let us know when you

  1. encounter the issue again and
  2. confirmed that the command (on the applicable worker)
    cat /sys/kernel/mm/transparent_hugepage/enabled
    
    shows
    always madvise [never]
    
    when the issue occurs, that is, transparent_hugepage has in fact been disabled on the relevant worker and not reenabled (automatically when the system is running or during system startup).