Kill signal sent by CryoSPARC (ID: <Heartbeat Monitor>)

Dear community
I am experiencing problems with cryosparc. It stop some of the runs with any apparent reason. and give the following message of error:

**** Kill signal sent by CryoSPARC (ID: ) ****

Job is unresponsive - no heartbeat received in 180 seconds.

Do you know who could help with this problem?

Thank you very much in advance.

Santiago

1 Like

Welcome to the forum @sanjusare .

You may see this message when the CryoSPARC master service fails to receive expected progress updates from the CryoSPARC job.
To help troubleshoot this problem, please collect the outputs of these commands as text, redact confidential information and post here. Replace in the code P99 and J199 with the failed job’s actual project and job IDs, respectively

  • commands to run on the CryoSPARC master
    cspid="P99"
    csjid="J199"
    c13m eventlog "$cspid" "$csjid" | sed '/Importing job module/q'
    cryosparcm eventlog "$cspid" "$csjid" | tail -n 20
    cryosparcm joblog "$cspid" "$csjid" | sed '/MONITOR PROCESS/q'
    cryosparcm joblog "$cspid" "$csjid" | tail -n 40
    cryosparcm cli "get_job('$cspid', '$csjid', 'job_type', 'version', 'instance_information', 'status', 'killed_at', 'started_at', 'params_spec')"
    
  • commands to run on the relevant CryoSPARC worker
    hostname
    free -h
    cat /sys/kernel/mm/transparent_hugepage/enabled
    

These are the outputs of the commands as requested:

cryosparc_user@cryo:/tmp$ cryosparcm eventlog “$cspid” “$csjid” | sed ‘/Importing job module/q’
[Fri, 20 Sep 2024 00:01:19 GMT] License is valid.
[Fri, 20 Sep 2024 00:01:19 GMT] Launching job on lane default target cryo …
[Fri, 20 Sep 2024 00:01:19 GMT] Running job on master node hostname cryo
[Fri, 20 Sep 2024 00:01:21 GMT] [CPU RAM used: 91 MB] Job J169 Started
[Fri, 20 Sep 2024 00:01:21 GMT] [CPU RAM used: 91 MB] Master running v4.5.1, worker running v4.5.1
[Fri, 20 Sep 2024 00:01:21 GMT] [CPU RAM used: 91 MB] Working in directory: /vol_dados/scipion/cs_projects/CynD_wt_Cambridge-santiago/CS-cynd-wt-cambridge-santiago/J169
[Fri, 20 Sep 2024 00:01:21 GMT] [CPU RAM used: 91 MB] Running on lane default
[Fri, 20 Sep 2024 00:01:21 GMT] [CPU RAM used: 91 MB] Resources allocated:
[Fri, 20 Sep 2024 00:01:21 GMT] [CPU RAM used: 91 MB] Worker: cryo
[Fri, 20 Sep 2024 00:01:21 GMT] [CPU RAM used: 91 MB] CPU : [0, 1, 2, 3]
[Fri, 20 Sep 2024 00:01:21 GMT] [CPU RAM used: 91 MB] GPU : [0]
[Fri, 20 Sep 2024 00:01:21 GMT] [CPU RAM used: 91 MB] RAM : [0, 1, 2]
[Fri, 20 Sep 2024 00:01:21 GMT] [CPU RAM used: 91 MB] SSD : False
[Fri, 20 Sep 2024 00:01:21 GMT] [CPU RAM used: 91 MB] --------------------------------------------------------------
[Fri, 20 Sep 2024 00:01:21 GMT] [CPU RAM used: 91 MB] Importing job module for job type var_3D…

cryosparc_user@cryo:/tmp$ cryosparcm eventlog “$cspid” “$csjid” | tail -n 20
[Fri, 20 Sep 2024 00:01:32 GMT] [CPU RAM used: 1041 MB] Reading particle stack…
[Fri, 20 Sep 2024 00:01:32 GMT] [CPU RAM used: 1123 MB] Windowing particles
[Fri, 20 Sep 2024 00:01:32 GMT] [CPU RAM used: 1127 MB] Done.
[Fri, 20 Sep 2024 00:01:39 GMT] [CPU RAM used: 1334 MB] Will process 361080 particles
[Fri, 20 Sep 2024 00:01:39 GMT] [CPU RAM used: 1334 MB] Resampling mask to box size 550
[Fri, 20 Sep 2024 00:01:57 GMT] [CPU RAM used: 3371 MB] Starting 3D Variability =====================
[Fri, 20 Sep 2024 00:01:57 GMT] [CPU RAM used: 3371 MB] Initial reconstruction 1 of 2
[Fri, 20 Sep 2024 00:02:00 GMT] [CPU RAM used: 4373 MB] batch 362 of 362
[Fri, 20 Sep 2024 00:24:51 GMT] [CPU RAM used: 8827 MB] Initial reconstruction 2 of 2
[Fri, 20 Sep 2024 00:24:52 GMT] [CPU RAM used: 8827 MB] batch 362 of 362
[Fri, 20 Sep 2024 00:47:48 GMT] [CPU RAM used: 8857 MB] Using a colored noise model
[Fri, 20 Sep 2024 00:47:49 GMT] Noise Model
[Fri, 20 Sep 2024 00:47:49 GMT] [CPU RAM used: 8860 MB] Starting iterations…
[Fri, 20 Sep 2024 00:47:50 GMT] [CPU RAM used: 8881 MB] Using random seed 1067639864
[Fri, 20 Sep 2024 00:47:50 GMT] [CPU RAM used: 8881 MB] Start iteration 0 of 20
[Fri, 20 Sep 2024 00:47:50 GMT] [CPU RAM used: 8881 MB] batch 362 of 362
[Fri, 20 Sep 2024 01:15:17 GMT] [CPU RAM used: 10794 MB] Done. Solving…
[Fri, 20 Sep 2024 01:15:41 GMT] [CPU RAM used: 13737 MB] diagnostic: min-ev 0.2902069664001465
[Fri, 20 Sep 2024 13:17:08 GMT] **** Kill signal sent by CryoSPARC (ID: ) ****
[Fri, 20 Sep 2024 13:17:08 GMT] Job is unresponsive - no heartbeat received in 600 seconds.

cryosparc_user@cryo:/tmp$ cryosparcm joblog “$cspid” “$csjid” | sed ‘/MONITOR PROCESS/q’

================= CRYOSPARCW ======= 2024-09-19 21:01:20.347920 =========
Project P3 Job J169
Master cryo Port 39002

MAIN PROCESS PID 95535
========= now starting main process at 2024-09-19 21:01:20.349082
var3D.run cryosparc_compute.jobs.jobregister
/home/cryosparc_user/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numpy/core/getlimits.py:499: UserWarning: The value of the smallest subnormal for <class ‘numpy.float64’> type is zero.
setattr(self, word, getattr(machar, word).flat[0])
/home/cryosparc_user/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class ‘numpy.float64’> type is zero.
return self._float_to_str(self.smallest_subnormal)
/home/cryosparc_user/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numpy/core/getlimits.py:499: UserWarning: The value of the smallest subnormal for <class ‘numpy.float32’> type is zero.
setattr(self, word, getattr(machar, word).flat[0])
/home/cryosparc_user/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class ‘numpy.float32’> type is zero.
return self._float_to_str(self.smallest_subnormal)
MONITOR PROCESS PID 95551

cryosparc_user@cryo:/tmp$ cryosparcm joblog “$cspid” “$csjid” | tail -n 40
========= sending heartbeat at 2024-09-19 22:10:11.565981
========= sending heartbeat at 2024-09-19 22:10:21.588983
========= sending heartbeat at 2024-09-19 22:10:31.612982
========= sending heartbeat at 2024-09-19 22:10:41.661986
========= sending heartbeat at 2024-09-19 22:10:51.686983
========= sending heartbeat at 2024-09-19 22:11:01.712973
========= sending heartbeat at 2024-09-19 22:11:11.737975
========= sending heartbeat at 2024-09-19 22:11:21.761977
========= sending heartbeat at 2024-09-19 22:11:31.785983
========= sending heartbeat at 2024-09-19 22:11:41.809986
========= sending heartbeat at 2024-09-19 22:11:51.833984
========= sending heartbeat at 2024-09-19 22:12:01.860982
========= sending heartbeat at 2024-09-19 22:12:16.702984
========= sending heartbeat at 2024-09-19 22:12:26.954985
========= sending heartbeat at 2024-09-19 22:12:36.978981
========= sending heartbeat at 2024-09-19 22:12:47.000981
========= sending heartbeat at 2024-09-19 22:12:58.268983
========= sending heartbeat at 2024-09-19 22:13:09.036981
========= sending heartbeat at 2024-09-19 22:13:19.058983
========= sending heartbeat at 2024-09-19 22:13:29.079979
========= sending heartbeat at 2024-09-19 22:13:39.102982
========= sending heartbeat at 2024-09-19 22:13:49.125981
========= sending heartbeat at 2024-09-19 22:13:59.148984
========= sending heartbeat at 2024-09-19 22:14:09.173977
========= sending heartbeat at 2024-09-19 22:14:19.196982
========= sending heartbeat at 2024-09-19 22:14:29.219979
========= sending heartbeat at 2024-09-19 22:14:39.267978
========= sending heartbeat at 2024-09-19 22:14:49.290979
========= sending heartbeat at 2024-09-19 22:14:59.311979
========= sending heartbeat at 2024-09-19 22:15:09.333978
========= sending heartbeat at 2024-09-19 22:15:19.356984
========= sending heartbeat at 2024-09-19 22:15:29.380981
========= sending heartbeat at 2024-09-19 22:15:39.403983
========= sending heartbeat at 2024-09-19 22:15:49.425986
:1: DeprecationWarning: np.int is a deprecated alias for the builtin int. To silence this warning, use int by itself. Doing this will not modify any behavior and is safe. When replacing np.int, you may wish to use e.g. np.int64 or np.int32 to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: NumPy 1.20.0 Release Notes — NumPy v2.2.dev0 Manual
========= sending heartbeat at 2024-09-19 22:15:59.450978
========= sending heartbeat at 2024-09-19 22:16:09.474977
========= sending heartbeat at 2024-09-19 22:16:19.496978
========= sending heartbeat at 2024-09-19 22:16:29.518977

cryosparc_user@cryo:/tmp$ cryosparcm cli “get_job(‘$cspid’, ‘$csjid’, ‘job_type’, ‘version’, ‘instance_information’, ‘status’, ‘killed_at’, ‘started_at’, ‘params_spec’)”
{‘_id’: ‘66ecbb4ae5c35e890b33b3cb’, ‘instance_information’: {‘CUDA_version’: ‘11.8’, ‘available_memory’: ‘241.96GB’, ‘cpu_model’: ‘AMD Ryzen Threadripper PRO 5965WX 24-Cores’, ‘driver_version’: ‘12.2’, ‘gpu_info’: [{‘id’: 0, ‘mem’: 25390678016, ‘name’: ‘NVIDIA GeForce RTX 4090’, ‘pcie’: ‘0000:41:00’}, {‘id’: 1, ‘mem’: 25393692672, ‘name’: ‘NVIDIA GeForce RTX 4090’, ‘pcie’: ‘0000:42:00’}], ‘ofd_hard_limit’: 1048576, ‘ofd_soft_limit’: 1024, ‘physical_cores’: 24, ‘platform_architecture’: ‘x86_64’, ‘platform_node’: ‘cryo’, ‘platform_release’: ‘6.8.0-40-generic’, ‘platform_version’: ‘#40~22.04.3-Ubuntu SMP PREEMPT_DYNAMIC Tue Jul 30 17:30:19 UTC 2’, ‘total_memory’: ‘251.53GB’, ‘used_memory’: ‘7.34GB’}, ‘job_type’: ‘var_3D’, ‘killed_at’: ‘Fri, 20 Sep 2024 13:17:08 GMT’, ‘params_spec’: {‘compute_use_ssd’: {‘value’: False}, ‘var_filter_res’: {‘value’: 5.0}}, ‘project_uid’: ‘P3’, ‘started_at’: ‘Fri, 20 Sep 2024 00:01:21 GMT’, ‘status’: ‘failed’, ‘uid’: ‘J169’, ‘version’: ‘v4.5.1’}

I am not yet sure what caused the job to stall, after such a long time.

The job may run significantly faster if particle caching were enabled. Does your computer have a suitable cache device? What are the outputs of these commands on the CryoSPARC master computer:

uname
lsblk
cryosparcm cli "get_scheduler_targets()"
cat /sys/kernel/mm/transparent_hugepage/enabled
sudo journalctl | grep -i oom

cryosparc_user@cryo:/home/santiago/Desktop$ uname
Linux

cryosparc_user@cryo:/home/santiago/Desktop$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
loop0 7:0 0 55,7M 1 loop /snap/core18/2823
loop1 7:1 0 4K 1 loop /snap/bare/5
loop2 7:2 0 55,7M 1 loop /snap/core18/2829
loop3 7:3 0 63,9M 1 loop /snap/core20/2318
loop4 7:4 0 64M 1 loop /snap/core20/2379
loop5 7:5 0 74,3M 1 loop /snap/core22/1612
loop6 7:6 0 74,2M 1 loop /snap/core22/1621
loop7 7:7 0 271,2M 1 loop /snap/firefox/4848
loop8 7:8 0 271,4M 1 loop /snap/firefox/4955
loop9 7:9 0 346,3M 1 loop /snap/gnome-3-38-2004/119
loop10 7:10 0 349,7M 1 loop /snap/gnome-3-38-2004/143
loop11 7:11 0 504,2M 1 loop /snap/gnome-42-2204/172
loop12 7:12 0 505,1M 1 loop /snap/gnome-42-2204/176
loop13 7:13 0 91,7M 1 loop /snap/gtk-common-themes/1535
loop14 7:14 0 12,9M 1 loop /snap/snap-store/1113
loop15 7:15 0 12,2M 1 loop /snap/snap-store/1216
loop16 7:16 0 38,7M 1 loop /snap/snapd/21465
loop17 7:17 0 38,8M 1 loop /snap/snapd/21759
loop18 7:18 0 476K 1 loop /snap/snapd-desktop-integration/157
loop19 7:19 0 500K 1 loop /snap/snapd-desktop-integration/178
sda 8:0 0 81,9T 0 disk
└─sda1 8:1 0 81,9T 0 part /vol_dados
nvme0n1 259:0 0 931,5G 0 disk
├─nvme0n1p1 259:1 0 512M 0 part /boot/efi
└─nvme0n1p2 259:2 0 931G 0 part /var/snap/firefox/common/host-hunspell
/
nvme1n1 259:3 0 1,8T 0 disk
├─nvme1n1p1 259:4 0 512M 0 part
└─nvme1n1p2 259:5 0 1,8T 0 part /mnt/swap

cryosparc_user@cryo:/home/santiago/Desktop$ cryosparcm cli “get_scheduler_targets()”
[{‘cache_path’: ‘/mnt/swap’, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 10000, ‘desc’: None, ‘gpus’: [{‘id’: 0, ‘mem’: 25390678016, ‘name’: ‘NVIDIA GeForce RTX 4090’}, {‘id’: 1, ‘mem’: 25393692672, ‘name’: ‘NVIDIA GeForce RTX 4090’}], ‘hostname’: ‘cryo’, ‘lane’: ‘default’, ‘monitor_port’: None, ‘name’: ‘cryo’, ‘resource_fixed’: {‘SSD’: True}, ‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47], ‘GPU’: [0, 1], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]}, ‘ssh_str’: ‘cryosparc_user@cryo’, ‘title’: ‘Worker node cryo’, ‘type’: ‘node’, ‘worker_bin_path’: ‘/home/cryosparc_user/cryosparc/cryosparc_worker/bin/cryosparcw’}]

cryosparc_user@cryo:/home/santiago/Desktop$ cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never

(base) santiago@cryo:~/Desktop$ sudo journalctl | grep -i oom
set 24 19:12:22 cryo systemd-oomd[1018]: Killed /user.slice/user-1001.slice/user@1001.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-d7944da3-8853-41df-a37e-15f5af46f488.scope due to memory pressure for /user.slice/user-1001.slice/user@1001.service being 56.97% > 50.00% for > 20s with reclaim activity
set 24 19:12:22 cryo systemd[12986]: vte-spawn-d7944da3-8853-41df-a37e-15f5af46f488.scope: systemd-oomd killed 22 process(es) in this unit.
set 24 19:12:38 cryo systemd-oomd[1018]: Killed /user.slice/user-1001.slice/user@1001.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-53dd88fd-7d6d-44a7-960c-e19af359932f.scope due to memory pressure for /user.slice/user-1001.slice/user@1001.service being 70.79% > 50.00% for > 20s with reclaim activity
set 24 19:12:38 cryo systemd[12986]: vte-spawn-53dd88fd-7d6d-44a7-960c-e19af359932f.scope: systemd-oomd killed 20 process(es) in this unit.

The last command give me a very big output, the first lines are above.

A post was split to a new topic: Job is unresponsive - no heartbeat received

Interesting. Please can you also post the output of the command

id cryosparc_user

Is there a specific reason for disabling particle caching for the failed job?

cryosparc_user@cryo:/home/santiago/Desktop$ id cryosparc_user
uid=1002(cryosparc_user) gid=1002(cryosparc_user) grupos=1002(cryosparc_user),4(adm),24(cdrom),27(sudo),30(dip),46(plugdev),122(lpadmin),135(lxd),136(sambashare)

Is there a specific reason for disabling particle caching for the failed job?

No, there is not. It is just because I normally did not just the particle caching.

Thanks in advance for your help,

Santiago

I wonder whetherthe job may have stalled due to

  • high overall usage of system RAM and subsequent “swapping”. What us the output of the command
    free -h ?
  • or the transparent_hugepage (THP) setting. Does the job also fail if you disable THP by running the command
    sudo sh -c "echo never>/sys/kernel/mm/transparent_hugepage/enabled"
    
    The command’s effect would be lost after the next reboot, but the link above includes instructions for making the setting persistent.

(base) santiago@cryo:~/Desktop$ free -h
total used free shared buff/cache available
Mem: 251Gi 20Gi 58Gi 10Gi 172Gi 219Gi
Swap: 0B 0B 0B

I will try with that command. If I still have the error, I will inform you.

Thank you

I executed the suggested command and ran two 3D variability analyses jobs. One finished normally and the other had the same error.

[CPU: 12.82 GB Avail: 236.38 GB]

Reconstructing from 57020 images…
[CPU: 12.82 GB Avail: 236.38 GB]

batch 26 of 58

**** Kill signal sent by CryoSPARC (ID: ) ****

Dear wtempel,

I still have the same error, now with a 3Dflex training job.

Regards,

Santiago

Dear wtempel

Now I am having this error, when running a 3D flex reconstruction:
====== Job process terminated abnormally.

Any idea how to solve it please?

Santiago

The job may have exhausted the system RAM (“OOM”). When you encounter these errors, please post

  1. errors with context: include a few lines above and below the errors, as applicable
  2. relevant portions of both the event and job logs
  3. potential OOM-related entries from the system log
    csprojectid=P99 # replace with actual CryoSPARC project ID
    csjobid=J199 # replace with actual CryoSPARC job ID
    cryosparcm cli "get_job('$csprojectid', '$csjobid', 'job_type', 'version', 'instance_information', 'status', 'killed_at', 'started_at', 'failed_at', 'params_spec', 'cloned_from', 'PID_main', 'PID_monitor', 'errors_run')"
    cryosparcm joblog "$csprojectid" "$csjobid" | tail -n 20
    cryosparcm eventlog "$csprojectid" "$csjobid" | tail -n 20
    sudo journalctl | grep -i oom | tail -n 20
    
  4. details of possible additional workloads on the computer at the time of the job failure

Apologize to did not send the correct information, here it is:

cryosparc_user@cryo:/home$ cryosparcm cli “get_job(‘$csprojectid’, ‘$csjobid’, ‘job_type’, ‘version’, ‘instance_information’, ‘status’, ‘killed_at’, ‘started_at’, ‘failed_at’, ‘params_spec’, ‘cloned_from’, ‘PID_main’, ‘PID_monitor’, ‘errors_run’)”
{‘PID_main’: 1565612, ‘PID_monitor’: 1565626, ‘_id’: ‘6709516bebc6fde268170b60’, ‘cloned_from’: None, ‘errors_run’: [{‘message’: ‘Job process terminated abnormally.’, ‘warning’: False}], ‘failed_at’: ‘Fri, 11 Oct 2024 16:33:01 GMT’, ‘instance_information’: {‘CUDA_version’: ‘11.8’, ‘available_memory’: ‘241.91GB’, ‘cpu_model’: ‘AMD Ryzen Threadripper PRO 5965WX 24-Cores’, ‘driver_version’: ‘12.2’, ‘gpu_info’: [{‘id’: 0, ‘mem’: 25390678016, ‘name’: ‘NVIDIA GeForce RTX 4090’, ‘pcie’: ‘0000:41:00’}, {‘id’: 1, ‘mem’: 25393692672, ‘name’: ‘NVIDIA GeForce RTX 4090’, ‘pcie’: ‘0000:42:00’}], ‘ofd_hard_limit’: 1048576, ‘ofd_soft_limit’: 1024, ‘physical_cores’: 24, ‘platform_architecture’: ‘x86_64’, ‘platform_node’: ‘cryo’, ‘platform_release’: ‘6.8.0-45-generic’, ‘platform_version’: ‘#45~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Sep 11 15:25:05 UTC 2’, ‘total_memory’: ‘251.53GB’, ‘used_memory’: ‘7.37GB’}, ‘job_type’: ‘flex_highres’, ‘killed_at’: None, ‘params_spec’: {‘refine_gs_resplit’: {‘value’: True}}, ‘project_uid’: ‘P1’, ‘started_at’: ‘Fri, 11 Oct 2024 16:25:19 GMT’, ‘status’: ‘failed’, ‘uid’: ‘J226’, ‘version’: ‘v4.6.0’}

cryosparc_user@cryo:/home$ cryosparcm joblog “$csprojectid” “$csjobid” | tail -n 20
python(PyRun_StringFlags+0x7d)[0x654f159eb91d]
python(PyRun_SimpleStringFlags+0x3c)[0x654f159eb75c]
python(Py_RunMain+0x26b)[0x654f159ea66b]
python(Py_BytesMain+0x37)[0x654f159bb1f7]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x700249a29d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x700249a29e40]
python(+0x1cb0f1)[0x654f159bb0f1]
rax 0000000000000001 rbx 00006ff8badcefb0 rcx 00000000056e7508 rdx 0000000000000000
rsi 00000000056e7508 rdi 00000000056e7507 rbp 00006ff4e9200010 rsp 00007ffcb9461a60
r8 00000000056e7508 r9 00006ff168ab90b0 r10 00000000056e7508 r11 0000000000000001
r12 00006ff52a600010 r13 00006ff6dce49290 r14 fffffffffa918af7 r15 00006ff168ab90b8
0f af d6 66 0f 28 d1 4c 01 f2 0f 1f 00 48 63 7c 8d 00 f2 0f 10 04 cb 48 ff c1 48 01 d7
f2 41 0f 10 74 fd 00 f2 0f 59 f0 f2 41 0f 59 04 fc f2 0f 58 d6 f2 0f 58 c8 4c 39 c1 75
d2 f2 0f 59 cb 99
→ f2 41 0f 11 11 f7 3c 24 f2 43 0f 11 0c d9 49 83 c1 08 8d 42 01 4d 39 f9 75 93 8b
b4 24 a8 00 00 00 4c 8b 74 24 70 44 8b 14 24 8d 04 36 4c 8b 4c 24 08 4c 89 f1 48 8d 94
24 b0 00 00 00 4c 8d bc

========= main process now complete at 2024-10-11 13:33:01.326628.
========= monitor process now complete at 2024-10-11 13:33:01.376494.

cryosparc_user@cryo:/home$ cryosparcm eventlog “$csprojectid” “$csjobid” | tail -n 20
│ Acquired / Required │ 37.72 GiB / 0.00 B │
└─────────────────────┴───────────────────────┘
Progress: [▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇] 10/10 (100%)
Elapsed: 0h 00m 00s
Active jobs: P1-J226
SSD cache complete for 10 file(s)
──────────────────────────────────────────────────────────────
[Fri, 11 Oct 2024 16:25:51 GMT] [CPU RAM used: 5141 MB] Done.
[Fri, 11 Oct 2024 16:25:51 GMT] [CPU RAM used: 5141 MB] Preparing all particle CTF data…
[Fri, 11 Oct 2024 16:25:51 GMT] [CPU RAM used: 5141 MB] Preparing Gold Standard Split…
[Fri, 11 Oct 2024 16:25:51 GMT] [CPU RAM used: 5141 MB] Particles will be split from scratch into two halves.
[Fri, 11 Oct 2024 16:25:51 GMT] [CPU RAM used: 5141 MB] Split A contains 25000 particles
[Fri, 11 Oct 2024 16:25:51 GMT] [CPU RAM used: 5141 MB] Split B contains 25000 particles
[Fri, 11 Oct 2024 16:25:51 GMT] [CPU RAM used: 5141 MB] Setting up particle poses…
[Fri, 11 Oct 2024 16:25:51 GMT] [CPU RAM used: 5144 MB] ====== High resolution flexible refinement =======
[Fri, 11 Oct 2024 16:25:51 GMT] [CPU RAM used: 5144 MB] Max num L-BFGS iterations was set to 20
[Fri, 11 Oct 2024 16:25:51 GMT] [CPU RAM used: 5144 MB] Starting L-BFGS.
[Fri, 11 Oct 2024 16:25:51 GMT] [CPU RAM used: 5144 MB] Reconstructing half-map A
[Fri, 11 Oct 2024 16:25:51 GMT] [CPU RAM used: 5144 MB] Iteration 0 : 24000 / 25000 particles
[Fri, 11 Oct 2024 16:33:01 GMT] [CPU RAM used: 171 MB] ====== Job process terminated abnormally.

cryosparc_user@cryo:/home$ journalctl | grep -i oom | tail -n 20
set 30 18:34:13 cryo systemd-oomd[1081]: Swap is currently not detected; memory pressure usage will be degraded
set 30 18:34:13 cryo systemd[1]: Started Userspace Out-Of-Memory (OOM) Killer.
out 01 11:42:04 cryo systemd[1]: Starting Userspace Out-Of-Memory (OOM) Killer…
out 01 11:42:04 cryo systemd-oomd[1068]: Swap is currently not detected; memory pressure usage will be degraded
out 01 11:42:04 cryo systemd[1]: Started Userspace Out-Of-Memory (OOM) Killer.
out 01 14:13:04 cryo sshd[6460]: Invalid user joomla from 192.250.224.76 port 48906
out 01 14:13:06 cryo sshd[6460]: Failed password for invalid user joomla from 192.250.224.76 port 48906 ssh2
out 01 14:13:07 cryo sshd[6460]: Connection closed by invalid user joomla 192.250.224.76 port 48906 [preauth]
out 01 21:06:02 cryo systemd-oomd[1068]: Killed /user.slice/user-1000.slice/user@1000.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-de51ef1a-52de-4675-bfac-e5c1688cc466.scope due to memory pressure for /user.slice/user-1000.slice/user@1000.service being 58.53% > 50.00% for > 20s with reclaim activity
out 01 21:06:02 cryo systemd[2598]: vte-spawn-de51ef1a-52de-4675-bfac-e5c1688cc466.scope: systemd-oomd killed 168 process(es) in this unit.
out 02 12:50:45 cryo systemd-oomd[1068]: Killed /user.slice/user-1001.slice/user@1001.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-f33da2d2-9f89-4043-a810-4419740cbf2c.scope due to memory pressure for /user.slice/user-1001.slice/user@1001.service being 51.70% > 50.00% for > 20s with reclaim activity
out 02 12:50:45 cryo systemd[8419]: vte-spawn-f33da2d2-9f89-4043-a810-4419740cbf2c.scope: systemd-oomd killed 158 process(es) in this unit.
out 03 10:40:23 cryo systemd-oomd[1068]: Killed /user.slice/user-1001.slice/user@1001.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-f564d8f1-ef35-4412-9065-2971526e3e8c.scope due to memory pressure for /user.slice/user-1001.slice/user@1001.service being 76.70% > 50.00% for > 20s with reclaim activity
out 03 10:40:23 cryo systemd[8419]: vte-spawn-f564d8f1-ef35-4412-9065-2971526e3e8c.scope: systemd-oomd killed 22 process(es) in this unit.
out 07 09:21:48 cryo systemd[1]: Starting Userspace Out-Of-Memory (OOM) Killer…
out 07 09:21:48 cryo systemd-oomd[1088]: Swap is currently not detected; memory pressure usage will be degraded
out 07 09:21:48 cryo systemd[1]: Started Userspace Out-Of-Memory (OOM) Killer.
out 09 22:54:32 cryo sshd[1119555]: Invalid user soomi from 103.76.120.61 port 44712
out 09 22:54:33 cryo sshd[1119555]: Failed password for invalid user soomi from 103.76.120.61 port 44712 ssh2
out 09 22:54:34 cryo sshd[1119555]: Disconnected from invalid user soomi 103.76.120.61 port 44712 [preauth]

Thanks for posting these logs. The job log seems to end with a longer error trace. Please can you show more lines by running the command:

cryosparcm joblog P1 J226 | tail -n 50

cryosparc_user@cryo:/home$ cryosparcm joblog P1 J226 | tail -n 50
========= sending heartbeat at 2024-10-11 13:32:01.168346
========= sending heartbeat at 2024-10-11 13:32:11.206084
========= sending heartbeat at 2024-10-11 13:32:21.240678
========= sending heartbeat at 2024-10-11 13:32:31.264419
========= sending heartbeat at 2024-10-11 13:32:41.277467
========= sending heartbeat at 2024-10-11 13:32:51.301239
Received SIGSEGV (addr=00006ff168ab90b0)
/home/cryosparc_user/cryosparc/cryosparc_worker/cryosparc_compute/ioengine/core.so(traceback_signal_handler+0x113)[0x700240a1b953]
/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x700249a42520]
/home/cryosparc_user/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/scipy/optimize/_lbfgsb.cpython-310-x86_64-linux-gnu.so(+0x975b)[0x70024061a75b]
/home/cryosparc_user/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/scipy/optimize/_lbfgsb.cpython-310-x86_64-linux-gnu.so(+0xf822)[0x700240620822]
/home/cryosparc_user/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/scipy/optimize/_lbfgsb.cpython-310-x86_64-linux-gnu.so(+0x10a7f)[0x700240621a7f]
/home/cryosparc_user/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/scipy/optimize/_lbfgsb.cpython-310-x86_64-linux-gnu.so(+0x4794)[0x700240615794]
python(_PyObject_MakeTpCall+0x26b)[0x654f1592da6b]
python(_PyEval_EvalFrameDefault+0x54a6)[0x654f159299d6]
python(_PyFunction_Vectorcall+0x6c)[0x654f15934a2c]
python(PyObject_Call+0xbc)[0x654f15940f1c]
python(_PyEval_EvalFrameDefault+0x2d83)[0x654f159272b3]
python(_PyFunction_Vectorcall+0x6c)[0x654f15934a2c]
python(PyVectorcall_Call+0xc5)[0x654f15941295]
/home/cryosparc_user/cryosparc/cryosparc_worker/cryosparc_compute/jobs/flex_refine/flexmod.cpython-310-x86_64-linux-gnu.so(+0x94ad0)[0x700236948ad0]
python(_PyEval_EvalFrameDefault+0x13ca)[0x654f159258fa]
python(_PyFunction_Vectorcall+0x6c)[0x654f15934a2c]
/home/cryosparc_user/cryosparc/cryosparc_worker/cryosparc_compute/run.cpython-310-x86_64-linux-gnu.so(+0x20e91)[0x70024a1cde91]
/home/cryosparc_user/cryosparc/cryosparc_worker/cryosparc_compute/run.cpython-310-x86_64-linux-gnu.so(+0x12c31)[0x70024a1bfc31]
python(_PyEval_EvalFrameDefault+0x4c12)[0x654f15929142]
python(+0x1d7c60)[0x654f159c7c60]
python(PyEval_EvalCode+0x87)[0x654f159c7ba7]
python(+0x20812a)[0x654f159f812a]
python(+0x203523)[0x654f159f3523]
python(PyRun_StringFlags+0x7d)[0x654f159eb91d]
python(PyRun_SimpleStringFlags+0x3c)[0x654f159eb75c]
python(Py_RunMain+0x26b)[0x654f159ea66b]
python(Py_BytesMain+0x37)[0x654f159bb1f7]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x700249a29d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x700249a29e40]
python(+0x1cb0f1)[0x654f159bb0f1]
rax 0000000000000001 rbx 00006ff8badcefb0 rcx 00000000056e7508 rdx 0000000000000000
rsi 00000000056e7508 rdi 00000000056e7507 rbp 00006ff4e9200010 rsp 00007ffcb9461a60
r8 00000000056e7508 r9 00006ff168ab90b0 r10 00000000056e7508 r11 0000000000000001
r12 00006ff52a600010 r13 00006ff6dce49290 r14 fffffffffa918af7 r15 00006ff168ab90b8
0f af d6 66 0f 28 d1 4c 01 f2 0f 1f 00 48 63 7c 8d 00 f2 0f 10 04 cb 48 ff c1 48 01 d7
f2 41 0f 10 74 fd 00 f2 0f 59 f0 f2 41 0f 59 04 fc f2 0f 58 d6 f2 0f 58 c8 4c 39 c1 75
d2 f2 0f 59 cb 99
→ f2 41 0f 11 11 f7 3c 24 f2 43 0f 11 0c d9 49 83 c1 08 8d 42 01 4d 39 f9 75 93 8b
b4 24 a8 00 00 00 4c 8b 74 24 70 44 8b 14 24 8d 04 36 4c 8b 4c 24 08 4c 89 f1 48 8d 94
24 b0 00 00 00 4c 8d bc

========= main process now complete at 2024-10-11 13:33:01.326628.
========= monitor process now complete at 2024-10-11 13:33:01.376494.

Dear wtempel,

Any idea of how can I solve this problem?,
I tried to run another job of 3D flex reconstruction with another dataset an this was the output error:

cryosparc_user@cryo:/home/santiago$ cryosparcm cli “get_job(‘$csprojectid’, ‘$csjobid’, ‘job_type’, ‘version’, ‘instance_information’, ‘status’, ‘killed_at’, ‘started_at’, ‘failed_at’, ‘params_spec’, ‘cloned_from’, ‘PID_main’, ‘PID_monitor’, ‘errors_run’)”
{‘PID_main’: 3417680, ‘PID_monitor’: 3417682, ‘_id’: ‘670e8da1ebc6fde268568a35’, ‘cloned_from’: None, ‘errors_run’: [{‘message’: ‘CUDA error: invalid configuration argument\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.\nCompile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n’, ‘warning’: False}], ‘failed_at’: ‘Tue, 15 Oct 2024 15:48:02 GMT’, ‘instance_information’: {‘CUDA_version’: ‘11.8’, ‘available_memory’: ‘240.07GB’, ‘cpu_model’: ‘AMD Ryzen Threadripper PRO 5965WX 24-Cores’, ‘driver_version’: ‘12.2’, ‘gpu_info’: [{‘id’: 0, ‘mem’: 25390678016, ‘name’: ‘NVIDIA GeForce RTX 4090’, ‘pcie’: ‘0000:41:00’}, {‘id’: 1, ‘mem’: 25393692672, ‘name’: ‘NVIDIA GeForce RTX 4090’, ‘pcie’: ‘0000:42:00’}], ‘ofd_hard_limit’: 1048576, ‘ofd_soft_limit’: 1024, ‘physical_cores’: 24, ‘platform_architecture’: ‘x86_64’, ‘platform_node’: ‘cryo’, ‘platform_release’: ‘6.8.0-45-generic’, ‘platform_version’: ‘#45~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Sep 11 15:25:05 UTC 2’, ‘total_memory’: ‘251.53GB’, ‘used_memory’: ‘9.21GB’}, ‘job_type’: ‘flex_highres’, ‘killed_at’: None, ‘params_spec’: {‘refine_gs_resplit’: {‘value’: True}}, ‘project_uid’: ‘P3’, ‘started_at’: ‘Tue, 15 Oct 2024 15:43:55 GMT’, ‘status’: ‘failed’, ‘uid’: ‘J308’, ‘version’: ‘v4.6.0’}

cryosparc_user@cryo:/home/santiago$ cryosparcm joblog “$csprojectid” “$csjobid” | tail -n 20
File “/home/cryosparc_user/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/scipy/optimize/_differentiable_functions.py”, line 145, in fun_wrapped
fx = fun(np.copy(x), *args)
File “/home/cryosparc_user/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/scipy/optimize/_optimize.py”, line 78, in call
self._compute_if_needed(x, *args)
File “/home/cryosparc_user/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/scipy/optimize/_optimize.py”, line 72, in _compute_if_needed
fg = self.fun(x, *args)
File “cryosparc_master/cryosparc_compute/jobs/flex_refine/flexmod.py”, line 1638, in cryosparc_master.cryosparc_compute.jobs.flex_refine.flexmod.do_hr_refinement_flex.lambda7
File “cryosparc_master/cryosparc_compute/jobs/flex_refine/flexmod.py”, line 1619, in cryosparc_master.cryosparc_compute.jobs.flex_refine.flexmod.errfunc_flex
File “/home/cryosparc_user/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/torch/_tensor.py”, line 492, in backward
torch.autograd.backward(
File “/home/cryosparc_user/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/torch/autograd/init.py”, line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

set status to failed
========= main process now complete at 2024-10-15 12:48:11.484356.
========= monitor process now complete at 2024-10-15 12:48:11.500277.

cryosparc_user@cryo:/home/santiago$ cryosparcm eventlog “$csprojectid” “$csjobid” | tail -n 20
File “/home/cryosparc_user/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/scipy/optimize/_differentiable_functions.py”, line 262, in _update_fun
self._update_fun_impl()
File “/home/cryosparc_user/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/scipy/optimize/_differentiable_functions.py”, line 163, in update_fun
self.f = fun_wrapped(self.x)
File “/home/cryosparc_user/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/scipy/optimize/_differentiable_functions.py”, line 145, in fun_wrapped
fx = fun(np.copy(x), *args)
File “/home/cryosparc_user/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/scipy/optimize/_optimize.py”, line 78, in call
self._compute_if_needed(x, *args)
File “/home/cryosparc_user/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/scipy/optimize/_optimize.py”, line 72, in _compute_if_needed
fg = self.fun(x, *args)
File “cryosparc_master/cryosparc_compute/jobs/flex_refine/flexmod.py”, line 1638, in cryosparc_master.cryosparc_compute.jobs.flex_refine.flexmod.do_hr_refinement_flex.lambda7
File “cryosparc_master/cryosparc_compute/jobs/flex_refine/flexmod.py”, line 1619, in cryosparc_master.cryosparc_compute.jobs.flex_refine.flexmod.errfunc_flex
File “/home/cryosparc_user/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/torch/_tensor.py”, line 492, in backward
torch.autograd.backward(
File “/home/cryosparc_user/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/torch/autograd/init.py”, line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

(base) santiago@cryo:~$ sudo journalctl | grep -i oom | tail -n 20
set 30 18:34:13 cryo systemd-oomd[1081]: Swap is currently not detected; memory pressure usage will be degraded
set 30 18:34:13 cryo systemd[1]: Started Userspace Out-Of-Memory (OOM) Killer.
out 01 11:42:04 cryo systemd[1]: Starting Userspace Out-Of-Memory (OOM) Killer…
out 01 11:42:04 cryo systemd-oomd[1068]: Swap is currently not detected; memory pressure usage will be degraded
out 01 11:42:04 cryo systemd[1]: Started Userspace Out-Of-Memory (OOM) Killer.
out 01 14:13:04 cryo sshd[6460]: Invalid user joomla from 192.250.224.76 port 48906
out 01 14:13:06 cryo sshd[6460]: Failed password for invalid user joomla from 192.250.224.76 port 48906 ssh2
out 01 14:13:07 cryo sshd[6460]: Connection closed by invalid user joomla 192.250.224.76 port 48906 [preauth]
out 01 21:06:02 cryo systemd-oomd[1068]: Killed /user.slice/user-1000.slice/user@1000.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-de51ef1a-52de-4675-bfac-e5c1688cc466.scope due to memory pressure for /user.slice/user-1000.slice/user@1000.service being 58.53% > 50.00% for > 20s with reclaim activity
out 01 21:06:02 cryo systemd[2598]: vte-spawn-de51ef1a-52de-4675-bfac-e5c1688cc466.scope: systemd-oomd killed 168 process(es) in this unit.
out 02 12:50:45 cryo systemd-oomd[1068]: Killed /user.slice/user-1001.slice/user@1001.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-f33da2d2-9f89-4043-a810-4419740cbf2c.scope due to memory pressure for /user.slice/user-1001.slice/user@1001.service being 51.70% > 50.00% for > 20s with reclaim activity
out 02 12:50:45 cryo systemd[8419]: vte-spawn-f33da2d2-9f89-4043-a810-4419740cbf2c.scope: systemd-oomd killed 158 process(es) in this unit.
out 03 10:40:23 cryo systemd-oomd[1068]: Killed /user.slice/user-1001.slice/user@1001.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-f564d8f1-ef35-4412-9065-2971526e3e8c.scope due to memory pressure for /user.slice/user-1001.slice/user@1001.service being 76.70% > 50.00% for > 20s with reclaim activity
out 03 10:40:23 cryo systemd[8419]: vte-spawn-f564d8f1-ef35-4412-9065-2971526e3e8c.scope: systemd-oomd killed 22 process(es) in this unit.
out 07 09:21:48 cryo systemd[1]: Starting Userspace Out-Of-Memory (OOM) Killer…
out 07 09:21:48 cryo systemd-oomd[1088]: Swap is currently not detected; memory pressure usage will be degraded
out 07 09:21:48 cryo systemd[1]: Started Userspace Out-Of-Memory (OOM) Killer.
out 09 22:54:32 cryo sshd[1119555]: Invalid user soomi from 103.76.120.61 port 44712
out 09 22:54:33 cryo sshd[1119555]: Failed password for invalid user soomi from 103.76.120.61 port 44712 ssh2
out 09 22:54:34 cryo sshd[1119555]: Disconnected from invalid user soomi 103.76.120.61 port 44712 [preauth]

Thank you very much in advance,

Santiago

We unfortunately do not have a solution for this problem.

Please can you post the output of the command
nvidia-smi on the computer name cryo.