Hi,
I am experiencing abnormal failures with homogeneous refinement job and have been unable to determine the cause. For your reference, I am attaching the last few lines from the metadata tab.
The box size I used is 480 px.
Thank you
Hi,
I am experiencing abnormal failures with homogeneous refinement job and have been unable to determine the cause. For your reference, I am attaching the last few lines from the metadata tab.
The box size I used is 480 px.
Thank you
Please can you post the outputs of these commands:
cryosparcm cli "get_scheduler_targets()"
cryosparcm cli "get_job('P99', 'J199', 'job_type', 'version', 'instance_information', 'status', 'params_spec', 'errors_run', 'started_at')"
where P99
, J199
have been replaced with the actual project and job IDs of the failed job.
Thank you for your response. As requested, here is the output of the suggested commands:
dawson@BSNO13019-Ubuntu:~$ cryosparcm cli âget_scheduler_targets()â
[{âcache_pathâ: â/home/dawson/cryosparc_cacheâ, âcache_quota_mbâ: None, âcache_reserve_mbâ: 10000, âdescâ: None, âgpusâ: [{âidâ: 0, âmemâ: 17056202752, ânameâ: âQuadro P5000â}], âhostnameâ: âBSNO13019-Ubuntuâ, âlaneâ: âdefaultâ, âmonitor_portâ: None, ânameâ: âBSNO13019-Ubuntuâ, âresource_fixedâ: {âSSDâ: True}, âresource_slotsâ: {âCPUâ: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], âGPUâ: [0], âRAMâ: [0, 1, 2, 3]}, âssh_strâ: âdawson@BSNO13019-Ubuntuâ, âtitleâ: âWorker node BSNO13019-Ubuntuâ, âtypeâ: ânodeâ, âworker_bin_pathâ: â/home/dawson/cryosparc_worker/bin/cryosparcwâ}]
dawson@BSNO13019-Ubuntu:~$ cryosparcm cli âget_job(âP99â, âJ199â, âjob_typeâ, âversionâ, âinstance_informationâ, âstatusâ, âparams_specâ, âerrors_runâ, âstarted_atâ)â
*** (http://BSNO13019-Ubuntu:39002, code 400) Encountered ServerError from JSONRPC function âget_jobâ with params (âP99â, âJ199â, âjob_typeâ, âversionâ, âinstance_informationâ, âstatusâ, âparams_specâ, âerrors_runâ, âstarted_atâ):
ServerError: P99 J199 does not exist.
Traceback (most recent call last):
File â/home/dawson/cryosparc_master/cryosparc_command/commandcommon.pyâ, line 196, in wrapper
res = func(*args, **kwargs)
File â/home/dawson/cryosparc_master/cryosparc_command/command_core/init.pyâ, line 6132, in get_job
raise ValueError(f"{project_uid} {job_uid} does not exist.")
ValueError: P99 J199 does not exist.
Please can you re-run the get_job()
command that failed with the actual project ID (instead of P99
) and the actual job ID (instead of J199
) of the failed job that corresponds to the screenshot in this topicâs first post.
Thank you here is the output of the command P20 and J130
dawson@BSNO13019-Ubuntu:~$ cryosparcm cli âget_job(âP20â, âJ130â, âjob_typeâ, âversionâ, âinstance_informationâ, âstatusâ, âparams_specâ, âerrors_runâ, âstarted_atâ)â
{â_idâ: â67af1bce380f4265cf7b7562â, âerrors_runâ: , âinstance_informationâ: {âCUDA_versionâ: â11.8â, âavailable_memoryâ: â25.44GBâ, âcpu_modelâ: âIntel(R) Xeon(R) W-2145 CPU @ 3.70GHzâ, âdriver_versionâ: â12.6â, âgpu_infoâ: [{âidâ: 0, âmemâ: 17056202752, ânameâ: âQuadro P5000â, âpcieâ: â0000:65:00â}], âofd_hard_limitâ: 1048576, âofd_soft_limitâ: 1024, âphysical_coresâ: 8, âplatform_architectureâ: âx86_64â, âplatform_nodeâ: âBSNO13019-Ubuntuâ, âplatform_releaseâ: â5.15.0-131-genericâ, âplatform_versionâ: â#141~20.04.1-Ubuntu SMP Thu Jan 16 18:38:51 UTC 2025â, âtotal_memoryâ: â31.03GBâ, âused_memoryâ: â4.83GBâ}, âjob_typeâ: âhomo_abinitâ, âparams_specâ: {âabinit_max_resâ: {âvalueâ: 4}}, âproject_uidâ: âP20â, âstarted_atâ: âFri, 14 Feb 2025 10:33:54 GMTâ, âstatusâ: âcompletedâ, âuidâ: âJ130â, âversionâ: âv4.6.2â}
Thanks @aasif for posting the output. Are you sure this is the output for the correct job? You mentioned earlier
but the get_job()
output corresponds to a different job type. Please can you
get_job()
output for that failed jobsudo journalctl | grep -i oom
Hi @wtempel,
Yes, I am certain about the job, as it is the most recent Homogeneous Refinement job. The first refinement job had failed, and I have provided the output from the second Homogeneous Refinement job for your review.
Finally, please find attached the output of the suggested command as requested.
**dawson@BSNO13019-Ubuntu**:**~**$ sudo journalctl | grep -I oom
[sudo] password for dawson:
Feb 08 18:29:19 BSNO13019-Ubuntu kernel: sshd invoked **oom**-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, **oom**_score_adj=0
Feb 08 18:29:19 BSNO13019-Ubuntu kernel: **oom**_kill_process.cold+0xb/0x10
Feb 08 18:29:23 BSNO13019-Ubuntu kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents **oom**_score_adj name
Feb 08 18:29:23 BSNO13019-Ubuntu kernel: **oom**-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_**oom**,task_memcg=/user.slice/user-1000.slice/user@1000.service,task=chrome,pid=3958,uid=1000
Feb 08 18:29:23 BSNO13019-Ubuntu kernel: Out of memory: Killed process 3958 (chrome) total-vm:1212242196kB, anon-rss:106932kB, file-rss:0kB, shmem-rss:3892kB, UID:1000 pgtables:2872kB **oom**_score_adj:300
Feb 08 18:29:23 BSNO13019-Ubuntu kernel: gmain invoked **oom**-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, **oom**_score_adj=0
Feb 08 18:29:23 BSNO13019-Ubuntu kernel: **oom**_kill_process.cold+0xb/0x10
Feb 08 18:29:23 BSNO13019-Ubuntu kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents **oom**_score_adj name
Feb 08 18:29:23 BSNO13019-Ubuntu kernel: **oom**-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_**oom**,task_memcg=/user.slice/user-1000.slice/user@1000.service,task=chrome,pid=29378,uid=1000
Feb 08 18:29:23 BSNO13019-Ubuntu kernel: Out of memory: Killed process 29378 (chrome) total-vm:1212234672kB, anon-rss:75908kB, file-rss:0kB, shmem-rss:1556kB, UID:1000 pgtables:1464kB **oom**_score_adj:300
Feb 08 18:29:23 BSNO13019-Ubuntu kernel: rsSync invoked **oom**-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, **oom**_score_adj=0
Feb 08 18:29:23 BSNO13019-Ubuntu kernel: **oom**_kill_process.cold+0xb/0x10
Feb 08 18:29:23 BSNO13019-Ubuntu kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents **oom**_score_adj name
Feb 08 18:29:23 BSNO13019-Ubuntu kernel: **oom**-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_**oom**,task_memcg=/user.slice/user-1000.slice/user@1000.service,task=chrome,pid=525583,uid=1000
Feb 08 18:29:23 BSNO13019-Ubuntu kernel: Out of memory: Killed process 525583 (chrome) total-vm:1212090580kB, anon-rss:9020kB, file-rss:0kB, shmem-rss:320kB, UID:1000 pgtables:652kB **oom**_score_adj:300
Feb 08 18:29:23 BSNO13019-Ubuntu kernel: python invoked **oom**-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, **oom**_score_adj=0
Feb 08 18:29:23 BSNO13019-Ubuntu kernel: **oom**_kill_process.cold+0xb/0x10
Feb 08 18:29:23 BSNO13019-Ubuntu kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents **oom**_score_adj name
Feb 08 18:29:23 BSNO13019-Ubuntu kernel: **oom**-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_**oom**,task_memcg=/user.slice/user-1000.slice/user@1000.service,task=python,pid=726158,uid=1000
Feb 08 18:29:23 BSNO13019-Ubuntu kernel: Out of memory: Killed process 726158 (python) total-vm:18591712kB, anon-rss:7937680kB, file-rss:0kB, shmem-rss:8kB, UID:1000 pgtables:16856kB **oom**_score_adj:0
Feb 08 18:30:59 BSNO13019-Ubuntu kernel: rtkit-daemon invoked **oom**-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, **oom**_score_adj=0
Feb 08 18:30:59 BSNO13019-Ubuntu kernel: **oom**_kill_process.cold+0xb/0x10
Feb 08 18:30:59 BSNO13019-Ubuntu kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents **oom**_score_adj name
Feb 08 18:30:59 BSNO13019-Ubuntu kernel: **oom**-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_**oom**,task_memcg=/user.slice/user-1000.slice/user@1000.service,task=python,pid=726160,uid=1000
Feb 08 18:30:59 BSNO13019-Ubuntu kernel: Out of memory: Killed process 726160 (python) total-vm:21409764kB, anon-rss:10121160kB, file-rss:0kB, shmem-rss:8kB, UID:1000 pgtables:21224kB **oom**_score_adj:0
Feb 08 18:33:23 BSNO13019-Ubuntu kernel: chrome invoked **oom**-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, **oom**_score_adj=200
Feb 08 18:33:23 BSNO13019-Ubuntu kernel: **oom**_kill_process.cold+0xb/0x10
Feb 08 18:33:23 BSNO13019-Ubuntu kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents **oom**_score_adj name
Feb 08 18:33:23 BSNO13019-Ubuntu kernel: **oom**-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_**oom**,task_memcg=/user.slice/user-1000.slice/user@1000.service,task=python,pid=726161,uid=1000
Feb 08 18:33:23 BSNO13019-Ubuntu kernel: Out of memory: Killed process 726161 (python) total-vm:27635684kB, anon-rss:14772444kB, file-rss:0kB, shmem-rss:8kB, UID:1000 pgtables:30708kB **oom**_score_adj:0
Feb 08 18:40:11 BSNO13019-Ubuntu kernel: chrome invoked **oom**-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, **oom**_score_adj=200
Feb 08 18:40:11 BSNO13019-Ubuntu kernel: **oom**_kill_process.cold+0xb/0x10
Feb 08 18:40:11 BSNO13019-Ubuntu kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents **oom**_score_adj name
Feb 08 18:40:11 BSNO13019-Ubuntu kernel: **oom**-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_**oom**,task_memcg=/user.slice/user-1000.slice/user@1000.service,task=python,pid=726159,uid=1000
Feb 08 18:40:11 BSNO13019-Ubuntu kernel: Out of memory: Killed process 726159 (python) total-vm:46247908kB, anon-rss:28786264kB, file-rss:0kB, shmem-rss:8kB, UID:1000 pgtables:59264kB **oom**_score_adj:0
Feb 15 10:12:37 BSNO13019-Ubuntu kernel: python invoked **oom**-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, **oom**_score_adj=0
Feb 15 10:12:38 BSNO13019-Ubuntu kernel: **oom**_kill_process.cold+0xb/0x10
Feb 15 10:12:38 BSNO13019-Ubuntu kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents **oom**_score_adj name
Feb 15 10:12:38 BSNO13019-Ubuntu kernel: **oom**-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_**oom**,task_memcg=/user.slice/user-1000.slice/user@1000.service,task=python,pid=1837406,uid=1000
Feb 15 10:12:38 BSNO13019-Ubuntu kernel: Out of memory: Killed process 1837406 (python) total-vm:47401216kB, anon-rss:25844416kB, file-rss:90732kB, shmem-rss:8192kB, UID:1000 pgtables:53476kB **oom**_score_adj:0
Feb 15 14:56:01 BSNO13019-Ubuntu kernel: chrome invoked **oom**-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, **oom**_score_adj=300
Feb 15 14:56:01 BSNO13019-Ubuntu kernel: **oom**_kill_process.cold+0xb/0x10
Feb 15 14:56:01 BSNO13019-Ubuntu kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents **oom**_score_adj name
Feb 15 14:56:01 BSNO13019-Ubuntu kernel: **oom**-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_**oom**,task_memcg=/user.slice/user-1000.slice/user@1000.service,task=python,pid=1869767,uid=1000
Feb 15 14:56:01 BSNO13019-Ubuntu kernel: Out of memory: Killed process 1869767 (python) total-vm:47925364kB, anon-rss:25828876kB, file-rss:91408kB, shmem-rss:8192kB, UID:1000 pgtables:53516kB **oom**_score_adj:0
Feb 15 18:32:37 BSNO13019-Ubuntu kernel: chimerax invoked **oom**-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, **oom**_score_adj=0
Feb 15 18:32:37 BSNO13019-Ubuntu kernel: **oom**_kill_process.cold+0xb/0x10
Feb 15 18:32:37 BSNO13019-Ubuntu kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents **oom**_score_adj name
Feb 15 18:32:37 BSNO13019-Ubuntu kernel: **oom**-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_**oom**,task_memcg=/user.slice/user-1000.slice/user@1000.service,task=python,pid=1894702,uid=1000
Feb 15 18:32:37 BSNO13019-Ubuntu kernel: Out of memory: Killed process 1894702 (python) total-vm:48202796kB, anon-rss:25837808kB, file-rss:92092kB, shmem-rss:8192kB, UID:1000 pgtables:53500kB **oom**_score_adj:0
Feb 15 19:05:03 BSNO13019-Ubuntu kernel: gmain invoked **oom**-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, **oom**_score_adj=0
Feb 15 19:05:03 BSNO13019-Ubuntu kernel: **oom**_kill_process.cold+0xb/0x10
Feb 15 19:05:03 BSNO13019-Ubuntu kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents **oom**_score_adj name
Feb 15 19:05:03 BSNO13019-Ubuntu kernel: **oom**-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_**oom**,task_memcg=/user.slice/user-1000.slice/user@1000.service,task=python,pid=1898724,uid=1000
Feb 15 19:05:03 BSNO13019-Ubuntu kernel: Out of memory: Killed process 1898724 (python) total-vm:47677480kB, anon-rss:25847156kB, file-rss:92184kB, shmem-rss:8192kB, UID:1000 pgtables:53700kB **oom**_score_adj:0
Feb 15 19:27:54 BSNO13019-Ubuntu kernel: python invoked **oom**-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, **oom**_score_adj=0
Feb 15 19:27:55 BSNO13019-Ubuntu kernel: **oom**_kill_process.cold+0xb/0x10
Feb 15 19:27:55 BSNO13019-Ubuntu kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents **oom**_score_adj name
Feb 15 19:27:55 BSNO13019-Ubuntu kernel: **oom**-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_**oom**,task_memcg=/user.slice/user-1000.slice/user@1000.service,task=python,pid=1901650,uid=1000
Feb 15 19:27:55 BSNO13019-Ubuntu kernel: Out of memory: Killed process 1901650 (python) total-vm:48201756kB, anon-rss:25848480kB, file-rss:92644kB, shmem-rss:8192kB, UID:1000 pgtables:53668kB **oom**_score_adj:0
Feb 17 10:25:46 BSNO13019-Ubuntu kernel: node invoked **oom**-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, **oom**_score_adj=0
Feb 17 10:25:47 BSNO13019-Ubuntu kernel: **oom**_kill_process.cold+0xb/0x10
Feb 17 10:25:47 BSNO13019-Ubuntu kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents **oom**_score_adj name
Feb 17 10:25:47 BSNO13019-Ubuntu kernel: **oom**-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_**oom**,task_memcg=/user.slice/user-1000.slice/user@1000.service,task=python,pid=2166749,uid=1000
Feb 17 10:25:47 BSNO13019-Ubuntu kernel: Out of memory: Killed process 2166749 (python) total-vm:47397420kB, anon-rss:25833760kB, file-rss:90128kB, shmem-rss:8192kB, UID:1000 pgtables:53484kB **oom**_score_adj:0
Feb 17 11:13:18 BSNO13019-Ubuntu kernel: python invoked **oom**-killer: gfp_mask=0x1100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, **oom**_score_adj=0
Feb 17 11:13:18 BSNO13019-Ubuntu kernel: **oom**_kill_process.cold+0xb/0x10
Feb 17 11:13:18 BSNO13019-Ubuntu kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents **oom**_score_adj name
Feb 17 11:13:18 BSNO13019-Ubuntu kernel: **oom**-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_**oom**,task_memcg=/user.slice/user-1000.slice/user@1000.service,task=python,pid=2172985,uid=1000
Feb 17 11:13:18 BSNO13019-Ubuntu kernel: Out of memory: Killed process 2172985 (python) total-vm:1076214800kB, anon-rss:25301564kB, file-rss:93000kB, shmem-rss:540716kB, UID:1000 pgtables:53496kB **oom**_score_adj:0
Feb 17 14:15:09 BSNO13019-Ubuntu kernel: python invoked **oom**-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, **oom**_score_adj=0
Feb 17 14:15:09 BSNO13019-Ubuntu kernel: **oom**_kill_process.cold+0xb/0x10
Feb 17 14:15:09 BSNO13019-Ubuntu kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents **oom**_score_adj name
Feb 17 14:15:09 BSNO13019-Ubuntu kernel: **oom**-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_**oom**,task_memcg=/user.slice/user-1000.slice/user@1000.service,task=python,pid=2195305,uid=1000
Feb 17 14:15:09 BSNO13019-Ubuntu kernel: Out of memory: Killed process 2195305 (python) total-vm:52448500kB, anon-rss:26989912kB, file-rss:90516kB, shmem-rss:8192kB, UID:1000 pgtables:55888kB **oom**_score_adj:0
Feb 18 11:30:57 BSNO13019-Ubuntu kernel: chrome invoked **oom**-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, **oom**_score_adj=200
Feb 18 11:30:57 BSNO13019-Ubuntu kernel: **oom**_kill_process.cold+0xb/0x10
Feb 18 11:30:57 BSNO13019-Ubuntu kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents **oom**_score_adj name
Feb 18 11:30:57 BSNO13019-Ubuntu kernel: **oom**-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_**oom**,task_memcg=/user.slice/user-1000.slice/user@1000.service,task=python,pid=2354053,uid=1000
Feb 18 11:30:57 BSNO13019-Ubuntu kernel: Out of memory: Killed process 2354053 (python) total-vm:49529960kB, anon-rss:26645280kB, file-rss:93404kB, shmem-rss:8192kB, UID:1000 pgtables:54176kB **oom**_score_adj:0
suggests that the job shown in the screenshot failed due to insufficient RAM. The 32 GB RAM configuration of BSNO13019-Ubuntu is right at the minimum specification for a CryoSPARC worker, and may be insufficient for some job types or inputs.
Hi @wtempel,
Thank you for sharing the update. I have already requested my IT department to upgrade the workstationâs RAM to 512 GB.
I really appreciate your prompt responses and the support youâve provided. Thank you once again for your help!
Best regards,
Aasif
After the RAM upgrade, you will need to update the worker configuration for the RAM to be recognized by CryoSPARC.
@wtempel I will take care of that once the upgrade is complete.
Thank you