The homogeneous refinement job process terminated unexpectedly

Hi,
I am experiencing abnormal failures with homogeneous refinement job and have been unable to determine the cause. For your reference, I am attaching the last few lines from the metadata tab.

The box size I used is 480 px.

Thank you

Please can you post the outputs of these commands:

cryosparcm cli "get_scheduler_targets()"
cryosparcm cli "get_job('P99', 'J199', 'job_type', 'version', 'instance_information', 'status',  'params_spec', 'errors_run', 'started_at')"

where P99, J199 have been replaced with the actual project and job IDs of the failed job.

1 Like

Thank you for your response. As requested, here is the output of the suggested commands:
dawson@BSNO13019-Ubuntu:~$ cryosparcm cli “get_scheduler_targets()”

[{‘cache_path’: ‘/home/dawson/cryosparc_cache’, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 10000, ‘desc’: None, ‘gpus’: [{‘id’: 0, ‘mem’: 17056202752, ‘name’: ‘Quadro P5000’}], ‘hostname’: ‘BSNO13019-Ubuntu’, ‘lane’: ‘default’, ‘monitor_port’: None, ‘name’: ‘BSNO13019-Ubuntu’, ‘resource_fixed’: {‘SSD’: True}, ‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], ‘GPU’: [0], ‘RAM’: [0, 1, 2, 3]}, ‘ssh_str’: ‘dawson@BSNO13019-Ubuntu’, ‘title’: ‘Worker node BSNO13019-Ubuntu’, ‘type’: ‘node’, ‘worker_bin_path’: ‘/home/dawson/cryosparc_worker/bin/cryosparcw’}]

dawson@BSNO13019-Ubuntu:~$ cryosparcm cli “get_job(‘P99’, ‘J199’, ‘job_type’, ‘version’, ‘instance_information’, ‘status’, ‘params_spec’, ‘errors_run’, ‘started_at’)”

*** (http://BSNO13019-Ubuntu:39002, code 400) Encountered ServerError from JSONRPC function “get_job” with params (‘P99’, ‘J199’, ‘job_type’, ‘version’, ‘instance_information’, ‘status’, ‘params_spec’, ‘errors_run’, ‘started_at’):

ServerError: P99 J199 does not exist.

Traceback (most recent call last):

File “/home/dawson/cryosparc_master/cryosparc_command/commandcommon.py”, line 196, in wrapper

res = func(*args, **kwargs)

File “/home/dawson/cryosparc_master/cryosparc_command/command_core/init.py”, line 6132, in get_job

raise ValueError(f"{project_uid} {job_uid} does not exist.")

ValueError: P99 J199 does not exist.

Please can you re-run the get_job() command that failed with the actual project ID (instead of P99) and the actual job ID (instead of J199) of the failed job that corresponds to the screenshot in this topic’s first post.

1 Like

Thank you here is the output of the command P20 and J130

dawson@BSNO13019-Ubuntu:~$ cryosparcm cli “get_job(‘P20’, ‘J130’, ‘job_type’, ‘version’, ‘instance_information’, ‘status’, ‘params_spec’, ‘errors_run’, ‘started_at’)”

{‘_id’: ‘67af1bce380f4265cf7b7562’, ‘errors_run’: , ‘instance_information’: {‘CUDA_version’: ‘11.8’, ‘available_memory’: ‘25.44GB’, ‘cpu_model’: ‘Intel(R) Xeon(R) W-2145 CPU @ 3.70GHz’, ‘driver_version’: ‘12.6’, ‘gpu_info’: [{‘id’: 0, ‘mem’: 17056202752, ‘name’: ‘Quadro P5000’, ‘pcie’: ‘0000:65:00’}], ‘ofd_hard_limit’: 1048576, ‘ofd_soft_limit’: 1024, ‘physical_cores’: 8, ‘platform_architecture’: ‘x86_64’, ‘platform_node’: ‘BSNO13019-Ubuntu’, ‘platform_release’: ‘5.15.0-131-generic’, ‘platform_version’: ‘#141~20.04.1-Ubuntu SMP Thu Jan 16 18:38:51 UTC 2025’, ‘total_memory’: ‘31.03GB’, ‘used_memory’: ‘4.83GB’}, ‘job_type’: ‘homo_abinit’, ‘params_spec’: {‘abinit_max_res’: {‘value’: 4}}, ‘project_uid’: ‘P20’, ‘started_at’: ‘Fri, 14 Feb 2025 10:33:54 GMT’, ‘status’: ‘completed’, ‘uid’: ‘J130’, ‘version’: ‘v4.6.2’}

Thanks @aasif for posting the output. Are you sure this is the output for the correct job? You mentioned earlier

but the get_job() output corresponds to a different job type. Please can you

  1. double check the IDs and, if you can identify the failed job, post get_job() output for that failed job
  2. post the output of the command
    sudo journalctl | grep -i oom
1 Like

Hi @wtempel,

Yes, I am certain about the job, as it is the most recent Homogeneous Refinement job. The first refinement job had failed, and I have provided the output from the second Homogeneous Refinement job for your review.

Finally, please find attached the output of the suggested command as requested.

**dawson@BSNO13019-Ubuntu**:**~**$ sudo journalctl | grep -I oom

[sudo] password for dawson:

Feb 08 18:29:19 BSNO13019-Ubuntu kernel: sshd invoked **oom**-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, **oom**_score_adj=0

Feb 08 18:29:19 BSNO13019-Ubuntu kernel: **oom**_kill_process.cold+0xb/0x10

Feb 08 18:29:23 BSNO13019-Ubuntu kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents **oom**_score_adj name

Feb 08 18:29:23 BSNO13019-Ubuntu kernel: **oom**-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_**oom**,task_memcg=/user.slice/user-1000.slice/user@1000.service,task=chrome,pid=3958,uid=1000

Feb 08 18:29:23 BSNO13019-Ubuntu kernel: Out of memory: Killed process 3958 (chrome) total-vm:1212242196kB, anon-rss:106932kB, file-rss:0kB, shmem-rss:3892kB, UID:1000 pgtables:2872kB **oom**_score_adj:300

Feb 08 18:29:23 BSNO13019-Ubuntu kernel: gmain invoked **oom**-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, **oom**_score_adj=0

Feb 08 18:29:23 BSNO13019-Ubuntu kernel: **oom**_kill_process.cold+0xb/0x10

Feb 08 18:29:23 BSNO13019-Ubuntu kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents **oom**_score_adj name

Feb 08 18:29:23 BSNO13019-Ubuntu kernel: **oom**-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_**oom**,task_memcg=/user.slice/user-1000.slice/user@1000.service,task=chrome,pid=29378,uid=1000

Feb 08 18:29:23 BSNO13019-Ubuntu kernel: Out of memory: Killed process 29378 (chrome) total-vm:1212234672kB, anon-rss:75908kB, file-rss:0kB, shmem-rss:1556kB, UID:1000 pgtables:1464kB **oom**_score_adj:300

Feb 08 18:29:23 BSNO13019-Ubuntu kernel: rsSync invoked **oom**-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, **oom**_score_adj=0

Feb 08 18:29:23 BSNO13019-Ubuntu kernel: **oom**_kill_process.cold+0xb/0x10

Feb 08 18:29:23 BSNO13019-Ubuntu kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents **oom**_score_adj name

Feb 08 18:29:23 BSNO13019-Ubuntu kernel: **oom**-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_**oom**,task_memcg=/user.slice/user-1000.slice/user@1000.service,task=chrome,pid=525583,uid=1000

Feb 08 18:29:23 BSNO13019-Ubuntu kernel: Out of memory: Killed process 525583 (chrome) total-vm:1212090580kB, anon-rss:9020kB, file-rss:0kB, shmem-rss:320kB, UID:1000 pgtables:652kB **oom**_score_adj:300

Feb 08 18:29:23 BSNO13019-Ubuntu kernel: python invoked **oom**-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, **oom**_score_adj=0

Feb 08 18:29:23 BSNO13019-Ubuntu kernel: **oom**_kill_process.cold+0xb/0x10

Feb 08 18:29:23 BSNO13019-Ubuntu kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents **oom**_score_adj name

Feb 08 18:29:23 BSNO13019-Ubuntu kernel: **oom**-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_**oom**,task_memcg=/user.slice/user-1000.slice/user@1000.service,task=python,pid=726158,uid=1000

Feb 08 18:29:23 BSNO13019-Ubuntu kernel: Out of memory: Killed process 726158 (python) total-vm:18591712kB, anon-rss:7937680kB, file-rss:0kB, shmem-rss:8kB, UID:1000 pgtables:16856kB **oom**_score_adj:0

Feb 08 18:30:59 BSNO13019-Ubuntu kernel: rtkit-daemon invoked **oom**-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, **oom**_score_adj=0

Feb 08 18:30:59 BSNO13019-Ubuntu kernel: **oom**_kill_process.cold+0xb/0x10

Feb 08 18:30:59 BSNO13019-Ubuntu kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents **oom**_score_adj name

Feb 08 18:30:59 BSNO13019-Ubuntu kernel: **oom**-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_**oom**,task_memcg=/user.slice/user-1000.slice/user@1000.service,task=python,pid=726160,uid=1000

Feb 08 18:30:59 BSNO13019-Ubuntu kernel: Out of memory: Killed process 726160 (python) total-vm:21409764kB, anon-rss:10121160kB, file-rss:0kB, shmem-rss:8kB, UID:1000 pgtables:21224kB **oom**_score_adj:0

Feb 08 18:33:23 BSNO13019-Ubuntu kernel: chrome invoked **oom**-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, **oom**_score_adj=200

Feb 08 18:33:23 BSNO13019-Ubuntu kernel: **oom**_kill_process.cold+0xb/0x10

Feb 08 18:33:23 BSNO13019-Ubuntu kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents **oom**_score_adj name

Feb 08 18:33:23 BSNO13019-Ubuntu kernel: **oom**-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_**oom**,task_memcg=/user.slice/user-1000.slice/user@1000.service,task=python,pid=726161,uid=1000

Feb 08 18:33:23 BSNO13019-Ubuntu kernel: Out of memory: Killed process 726161 (python) total-vm:27635684kB, anon-rss:14772444kB, file-rss:0kB, shmem-rss:8kB, UID:1000 pgtables:30708kB **oom**_score_adj:0

Feb 08 18:40:11 BSNO13019-Ubuntu kernel: chrome invoked **oom**-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, **oom**_score_adj=200

Feb 08 18:40:11 BSNO13019-Ubuntu kernel: **oom**_kill_process.cold+0xb/0x10

Feb 08 18:40:11 BSNO13019-Ubuntu kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents **oom**_score_adj name

Feb 08 18:40:11 BSNO13019-Ubuntu kernel: **oom**-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_**oom**,task_memcg=/user.slice/user-1000.slice/user@1000.service,task=python,pid=726159,uid=1000

Feb 08 18:40:11 BSNO13019-Ubuntu kernel: Out of memory: Killed process 726159 (python) total-vm:46247908kB, anon-rss:28786264kB, file-rss:0kB, shmem-rss:8kB, UID:1000 pgtables:59264kB **oom**_score_adj:0

Feb 15 10:12:37 BSNO13019-Ubuntu kernel: python invoked **oom**-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, **oom**_score_adj=0

Feb 15 10:12:38 BSNO13019-Ubuntu kernel: **oom**_kill_process.cold+0xb/0x10

Feb 15 10:12:38 BSNO13019-Ubuntu kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents **oom**_score_adj name

Feb 15 10:12:38 BSNO13019-Ubuntu kernel: **oom**-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_**oom**,task_memcg=/user.slice/user-1000.slice/user@1000.service,task=python,pid=1837406,uid=1000

Feb 15 10:12:38 BSNO13019-Ubuntu kernel: Out of memory: Killed process 1837406 (python) total-vm:47401216kB, anon-rss:25844416kB, file-rss:90732kB, shmem-rss:8192kB, UID:1000 pgtables:53476kB **oom**_score_adj:0

Feb 15 14:56:01 BSNO13019-Ubuntu kernel: chrome invoked **oom**-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, **oom**_score_adj=300

Feb 15 14:56:01 BSNO13019-Ubuntu kernel: **oom**_kill_process.cold+0xb/0x10

Feb 15 14:56:01 BSNO13019-Ubuntu kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents **oom**_score_adj name

Feb 15 14:56:01 BSNO13019-Ubuntu kernel: **oom**-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_**oom**,task_memcg=/user.slice/user-1000.slice/user@1000.service,task=python,pid=1869767,uid=1000

Feb 15 14:56:01 BSNO13019-Ubuntu kernel: Out of memory: Killed process 1869767 (python) total-vm:47925364kB, anon-rss:25828876kB, file-rss:91408kB, shmem-rss:8192kB, UID:1000 pgtables:53516kB **oom**_score_adj:0

Feb 15 18:32:37 BSNO13019-Ubuntu kernel: chimerax invoked **oom**-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, **oom**_score_adj=0

Feb 15 18:32:37 BSNO13019-Ubuntu kernel: **oom**_kill_process.cold+0xb/0x10

Feb 15 18:32:37 BSNO13019-Ubuntu kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents **oom**_score_adj name

Feb 15 18:32:37 BSNO13019-Ubuntu kernel: **oom**-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_**oom**,task_memcg=/user.slice/user-1000.slice/user@1000.service,task=python,pid=1894702,uid=1000

Feb 15 18:32:37 BSNO13019-Ubuntu kernel: Out of memory: Killed process 1894702 (python) total-vm:48202796kB, anon-rss:25837808kB, file-rss:92092kB, shmem-rss:8192kB, UID:1000 pgtables:53500kB **oom**_score_adj:0

Feb 15 19:05:03 BSNO13019-Ubuntu kernel: gmain invoked **oom**-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, **oom**_score_adj=0

Feb 15 19:05:03 BSNO13019-Ubuntu kernel: **oom**_kill_process.cold+0xb/0x10

Feb 15 19:05:03 BSNO13019-Ubuntu kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents **oom**_score_adj name

Feb 15 19:05:03 BSNO13019-Ubuntu kernel: **oom**-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_**oom**,task_memcg=/user.slice/user-1000.slice/user@1000.service,task=python,pid=1898724,uid=1000

Feb 15 19:05:03 BSNO13019-Ubuntu kernel: Out of memory: Killed process 1898724 (python) total-vm:47677480kB, anon-rss:25847156kB, file-rss:92184kB, shmem-rss:8192kB, UID:1000 pgtables:53700kB **oom**_score_adj:0

Feb 15 19:27:54 BSNO13019-Ubuntu kernel: python invoked **oom**-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, **oom**_score_adj=0

Feb 15 19:27:55 BSNO13019-Ubuntu kernel: **oom**_kill_process.cold+0xb/0x10

Feb 15 19:27:55 BSNO13019-Ubuntu kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents **oom**_score_adj name

Feb 15 19:27:55 BSNO13019-Ubuntu kernel: **oom**-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_**oom**,task_memcg=/user.slice/user-1000.slice/user@1000.service,task=python,pid=1901650,uid=1000

Feb 15 19:27:55 BSNO13019-Ubuntu kernel: Out of memory: Killed process 1901650 (python) total-vm:48201756kB, anon-rss:25848480kB, file-rss:92644kB, shmem-rss:8192kB, UID:1000 pgtables:53668kB **oom**_score_adj:0

Feb 17 10:25:46 BSNO13019-Ubuntu kernel: node invoked **oom**-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, **oom**_score_adj=0

Feb 17 10:25:47 BSNO13019-Ubuntu kernel: **oom**_kill_process.cold+0xb/0x10

Feb 17 10:25:47 BSNO13019-Ubuntu kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents **oom**_score_adj name

Feb 17 10:25:47 BSNO13019-Ubuntu kernel: **oom**-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_**oom**,task_memcg=/user.slice/user-1000.slice/user@1000.service,task=python,pid=2166749,uid=1000

Feb 17 10:25:47 BSNO13019-Ubuntu kernel: Out of memory: Killed process 2166749 (python) total-vm:47397420kB, anon-rss:25833760kB, file-rss:90128kB, shmem-rss:8192kB, UID:1000 pgtables:53484kB **oom**_score_adj:0

Feb 17 11:13:18 BSNO13019-Ubuntu kernel: python invoked **oom**-killer: gfp_mask=0x1100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, **oom**_score_adj=0

Feb 17 11:13:18 BSNO13019-Ubuntu kernel: **oom**_kill_process.cold+0xb/0x10

Feb 17 11:13:18 BSNO13019-Ubuntu kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents **oom**_score_adj name

Feb 17 11:13:18 BSNO13019-Ubuntu kernel: **oom**-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_**oom**,task_memcg=/user.slice/user-1000.slice/user@1000.service,task=python,pid=2172985,uid=1000

Feb 17 11:13:18 BSNO13019-Ubuntu kernel: Out of memory: Killed process 2172985 (python) total-vm:1076214800kB, anon-rss:25301564kB, file-rss:93000kB, shmem-rss:540716kB, UID:1000 pgtables:53496kB **oom**_score_adj:0

Feb 17 14:15:09 BSNO13019-Ubuntu kernel: python invoked **oom**-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, **oom**_score_adj=0

Feb 17 14:15:09 BSNO13019-Ubuntu kernel: **oom**_kill_process.cold+0xb/0x10

Feb 17 14:15:09 BSNO13019-Ubuntu kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents **oom**_score_adj name

Feb 17 14:15:09 BSNO13019-Ubuntu kernel: **oom**-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_**oom**,task_memcg=/user.slice/user-1000.slice/user@1000.service,task=python,pid=2195305,uid=1000

Feb 17 14:15:09 BSNO13019-Ubuntu kernel: Out of memory: Killed process 2195305 (python) total-vm:52448500kB, anon-rss:26989912kB, file-rss:90516kB, shmem-rss:8192kB, UID:1000 pgtables:55888kB **oom**_score_adj:0

Feb 18 11:30:57 BSNO13019-Ubuntu kernel: chrome invoked **oom**-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, **oom**_score_adj=200

Feb 18 11:30:57 BSNO13019-Ubuntu kernel: **oom**_kill_process.cold+0xb/0x10

Feb 18 11:30:57 BSNO13019-Ubuntu kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents **oom**_score_adj name

Feb 18 11:30:57 BSNO13019-Ubuntu kernel: **oom**-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_**oom**,task_memcg=/user.slice/user-1000.slice/user@1000.service,task=python,pid=2354053,uid=1000

Feb 18 11:30:57 BSNO13019-Ubuntu kernel: Out of memory: Killed process 2354053 (python) total-vm:49529960kB, anon-rss:26645280kB, file-rss:93404kB, shmem-rss:8192kB, UID:1000 pgtables:54176kB **oom**_score_adj:0

suggests that the job shown in the screenshot failed due to insufficient RAM. The 32 GB RAM configuration of BSNO13019-Ubuntu is right at the minimum specification for a CryoSPARC worker, and may be insufficient for some job types or inputs.

1 Like

Hi @wtempel,

Thank you for sharing the update. I have already requested my IT department to upgrade the workstation’s RAM to 512 GB.

I really appreciate your prompt responses and the support you’ve provided. Thank you once again for your help!

Best regards,
Aasif

After the RAM upgrade, you will need to update the worker configuration for the RAM to be recognized by CryoSPARC.

1 Like

@wtempel I will take care of that once the upgrade is complete.
Thank you