Stuck in queue forever

After updating to version 4.4.1, I encountered an annoying problem, i.e., my queued jobs just stayed in the queue and wouldn’t launch even after the inputs were ready. I have looked into other similar posts in the forum but didn’t find a fix. I am using an HPC system.

Here is an example from the “Extensive Validation.”

As you can see, J44 won’t launch even after J43 is long finished. Ran command “cryosparcm joblog P4 J44”, the error message is “/home/zhuj6/cryoEM/Test/CS-t20s/J44/job.log: No such file or directory”. Here is the file list in the J44 directory.

events.bson gridfs_data job.json

No submission script was created under J44. I didn’t have this problem when running version 4.3.x. I have also reinstalled version 4.4.1 after the first time, but this problem still remains. I would really like to get some help here. Thanks!

Here are the last lines from “command_core”.

2023-12-26 09:49:41,508 dump_job_database    INFO     | Request to export P4 J43
2023-12-26 09:49:41,515 dump_job_database    INFO     |    Exporting job to /home/zhuj6/cryoEM/Test/CS-t20s/J43
2023-12-26 09:49:41,516 dump_job_database    INFO     |    Exporting all of job's images in the database to /home/zhuj6/cryoEM/Test/CS-t20s/J43/gridfs_data...
2023-12-26 09:49:41,517 dump_job_database    INFO     |    Done. Exported 0 images in 0.00s
2023-12-26 09:49:41,517 dump_job_database    INFO     |    Exporting all job's streamlog events...
2023-12-26 09:49:41,522 dump_job_database    INFO     |    Done. Exported 1 files in 0.00s
2023-12-26 09:49:41,522 dump_job_database    INFO     |    Exporting job metafile...
2023-12-26 09:49:41,527 dump_job_database    INFO     |    Done. Exported in 0.01s
2023-12-26 09:49:41,527 dump_job_database    INFO     |    Updating job manifest...
2023-12-26 09:49:41,532 dump_job_database    INFO     |    Done. Updated in 0.00s
2023-12-26 09:49:41,532 dump_job_database    INFO     | Exported P4 J43 in 0.02s
2023-12-26 09:49:41,533 run                  INFO     | Completed task in 0.024985074996948242 seconds
2023-12-26 09:49:52,085 run                  INFO     | Received task dump_job_database with 2 args and 0 kwargs
2023-12-26 09:49:52,086 dump_job_database    INFO     | Request to export P4 J44
2023-12-26 09:49:52,091 dump_job_database    INFO     |    Exporting job to /home/zhuj6/cryoEM/Test/CS-t20s/J44
2023-12-26 09:49:52,092 dump_job_database    INFO     |    Exporting all of job's images in the database to /home/zhuj6/cryoEM/Test/CS-t20s/J44/gridfs_data...
2023-12-26 09:49:52,093 dump_job_database    INFO     |    Done. Exported 0 images in 0.00s
2023-12-26 09:49:52,093 dump_job_database    INFO     |    Exporting all job's streamlog events...
2023-12-26 09:49:52,098 dump_job_database    INFO     |    Done. Exported 1 files in 0.00s
2023-12-26 09:49:52,098 dump_job_database    INFO     |    Exporting job metafile...
2023-12-26 09:49:52,105 dump_job_database    INFO     |    Done. Exported in 0.01s
2023-12-26 09:49:52,106 dump_job_database    INFO     |    Updating job manifest...
2023-12-26 09:49:52,111 dump_job_database    INFO     |    Done. Updated in 0.01s
2023-12-26 09:49:52,111 dump_job_database    INFO     | Exported P4 J44 in 0.03s
2023-12-26 09:49:52,112 run                  INFO     | Completed task in 0.026586294174194336 seconds
2023-12-26 09:50:07,360 scheduler_run_core   INFO     | Running...
2023-12-26 09:50:07,360 scheduler_run_core   INFO     | Jobs Queued: [('P4', 'J43')]
2023-12-26 09:50:07,363 scheduler_run_core   INFO     | Licenses currently active : 4
2023-12-26 09:50:07,363 scheduler_run_core   INFO     | Now trying to schedule J43
2023-12-26 09:50:07,364 scheduler_run_core   INFO     |    Scheduling directly onto master node cn1577
2023-12-26 09:50:07,364 scheduler_run_job_master_direct INFO     |    Scheduling directly onto master node cn1577
2023-12-26 09:50:08,344 scheduler_run_job_master_direct INFO     | Not a commercial instance - heartbeat set to 12 hours.
2023-12-26 09:50:08,431 set_job_status       INFO     | Status changed for P4.J43 from queued to launched
2023-12-26 09:50:08,432 app_stats_refresh    INFO     | Calling app stats refresh url http://cn1577:39022/api/actions/stats/refresh_job for project_uid P4, workspace_uid None, job_uid J43 with body {'projectUid': 'P4', 'jobUid': 'J43'}
2023-12-26 09:50:08,437 app_stats_refresh    INFO     | code 200, text {"success":true}
2023-12-26 09:50:08,471 run_job              INFO     |       Running P4 J43
2023-12-26 09:50:08,472 run_job              INFO     |         Running job on master node directly
2023-12-26 09:50:08,473 run_job              INFO     |         Running job using: /vf/users/zhuj6/apps/cryosparc/cryosparc_master/bin/cryosparcm
2023-12-26 09:50:08,491 scheduler_run_core   INFO     | Finished
2023-12-26 09:50:18,701 set_job_status       INFO     | Status changed for P4.J43 from launched to started
2023-12-26 09:50:18,702 app_stats_refresh    INFO     | Calling app stats refresh url http://cn1577:39022/api/actions/stats/refresh_job for project_uid P4, workspace_uid None, job_uid J43 with body {'projectUid': 'P4', 'jobUid': 'J43'}
2023-12-26 09:50:18,709 app_stats_refresh    INFO     | code 200, text {"success":true}
2023-12-26 09:50:27,537 set_job_status       INFO     | Status changed for P4.J43 from started to running
2023-12-26 09:50:27,539 app_stats_refresh    INFO     | Calling app stats refresh url http://cn1577:39022/api/actions/stats/refresh_job for project_uid P4, workspace_uid None, job_uid J43 with body {'projectUid': 'P4', 'jobUid': 'J43'}
2023-12-26 09:50:27,544 app_stats_refresh    INFO     | code 200, text {"success":true}
2023-12-26 09:50:27,740 dump_project         INFO     | Exporting project P4
2023-12-26 09:50:27,747 dump_project         INFO     | Exported project P4 to /home/zhuj6/cryoEM/Test/CS-t20s/project.json in 0.01s
2023-12-26 09:51:02,314 dump_job_database    INFO     | Request to export P4 J43
2023-12-26 09:51:02,316 dump_job_database    INFO     |    Exporting job to /home/zhuj6/cryoEM/Test/CS-t20s/J43
2023-12-26 09:51:02,317 dump_job_database    INFO     |    Exporting all of job's images in the database to /home/zhuj6/cryoEM/Test/CS-t20s/J43/gridfs_data...
2023-12-26 09:51:02,341 dump_job_database    INFO     |    Writing 7 database images to /home/zhuj6/cryoEM/Test/CS-t20s/J43/gridfs_data/gridfsdata_0
2023-12-26 09:51:02,341 dump_job_database    INFO     |    Done. Exported 7 images in 0.02s
2023-12-26 09:51:02,341 dump_job_database    INFO     |    Exporting all job's streamlog events...
2023-12-26 09:51:02,347 dump_job_database    INFO     |    Done. Exported 1 files in 0.01s
2023-12-26 09:51:02,347 dump_job_database    INFO     |    Exporting job metafile...
2023-12-26 09:51:02,348 dump_job_database    INFO     |    Creating .csg file for imported_movies
2023-12-26 09:51:02,362 dump_job_database    INFO     |    Done. Exported in 0.01s
2023-12-26 09:51:02,362 dump_job_database    INFO     |    Updating job manifest...
2023-12-26 09:51:02,367 dump_job_database    INFO     |    Done. Updated in 0.00s
2023-12-26 09:51:02,367 dump_job_database    INFO     | Exported P4 J43 in 0.05s
2023-12-26 09:51:02,384 set_job_status       INFO     | Status changed for P4.J43 from running to completed
2023-12-26 09:51:02,387 app_stats_refresh    INFO     | Calling app stats refresh url http://cn1577:39022/api/actions/stats/refresh_job for project_uid P4, workspace_uid None, job_uid J43 with body {'projectUid': 'P4', 'jobUid': 'J43'}
2023-12-26 09:51:02,392 app_stats_refresh    INFO     | code 200, text {"success":true}

Here is a new thing. J44 was queued for a few hours and doing nothing. When I was working on another project and submitting another job, J44 was submitted as well at the same time. Here is the continued log.

2023-12-26 09:51:02,367 dump_job_database    INFO     | Exported P4 J43 in 0.05s
2023-12-26 09:51:02,384 set_job_status       INFO     | Status changed for P4.J43 from running to completed
2023-12-26 09:51:02,387 app_stats_refresh    INFO     | Calling app stats refresh url http://cn1577:39022/api/actions/stats/refresh_job for project_uid P4, workspace_uid None, job_uid J43 with body {'projectUid': 'P4', 'jobUid': 'J43'}
2023-12-26 09:51:02,392 app_stats_refresh    INFO     | code 200, text {"success":true}
2023-12-26 12:14:46,761 run                  INFO     | Received task dump_job_database with 2 args and 0 kwargs
2023-12-26 12:14:46,761 dump_job_database    INFO     | Request to export P13 J9
2023-12-26 12:14:46,770 dump_job_database    INFO     |    Exporting job to /home/zhuj6/cryoEM/SRRP/CS-srrp/J9
2023-12-26 12:14:46,779 dump_job_database    INFO     |    Exporting all of job's images in the database to /home/zhuj6/cryoEM/SRRP/CS-srrp/J9/gridfs_data...
2023-12-26 12:14:46,788 dump_job_database    INFO     |    Done. Exported 0 images in 0.01s
2023-12-26 12:14:46,788 dump_job_database    INFO     |    Exporting all job's streamlog events...
2023-12-26 12:14:46,795 dump_job_database    INFO     |    Done. Exported 1 files in 0.01s
2023-12-26 12:14:46,795 dump_job_database    INFO     |    Exporting job metafile...
2023-12-26 12:14:46,807 dump_job_database    INFO     |    Done. Exported in 0.01s
2023-12-26 12:14:46,808 dump_job_database    INFO     |    Updating job manifest...
2023-12-26 12:14:46,819 dump_job_database    INFO     |    Done. Updated in 0.01s
2023-12-26 12:14:46,819 dump_job_database    INFO     | Exported P13 J9 in 0.06s
2023-12-26 12:14:46,820 run                  INFO     | Completed task in 0.059456825256347656 seconds
2023-12-26 12:14:57,341 param_set_spec_value INFO     | Setting parameter J9.diameter_max with value 150 of python type <class 'int'> and param type number
2023-12-26 12:14:57,344 run                  INFO     | Received task dump_job_database with 2 args and 0 kwargs
2023-12-26 12:14:57,345 dump_job_database    INFO     | Request to export P13 J9
2023-12-26 12:14:57,348 dump_job_database    INFO     |    Exporting job to /home/zhuj6/cryoEM/SRRP/CS-srrp/J9
2023-12-26 12:14:57,349 dump_job_database    INFO     |    Exporting all of job's images in the database to /home/zhuj6/cryoEM/SRRP/CS-srrp/J9/gridfs_data...
2023-12-26 12:14:57,351 dump_job_database    INFO     |    Done. Exported 0 images in 0.00s
2023-12-26 12:14:57,351 dump_job_database    INFO     |    Exporting all job's streamlog events...
2023-12-26 12:14:57,357 dump_job_database    INFO     |    Done. Exported 1 files in 0.01s
2023-12-26 12:14:57,357 dump_job_database    INFO     |    Exporting job metafile...
2023-12-26 12:14:57,363 dump_job_database    INFO     |    Done. Exported in 0.01s
2023-12-26 12:14:57,363 dump_job_database    INFO     |    Updating job manifest...
2023-12-26 12:14:57,368 dump_job_database    INFO     |    Done. Updated in 0.00s
2023-12-26 12:14:57,368 dump_job_database    INFO     | Exported P13 J9 in 0.02s
2023-12-26 12:14:57,369 run                  INFO     | Completed task in 0.024834632873535156 seconds
2023-12-26 12:15:04,296 scheduler_run_core   INFO     | Running...
2023-12-26 12:15:04,296 scheduler_run_core   INFO     | Jobs Queued: [('P4', 'J44'), ('P13', 'J9')]
2023-12-26 12:15:04,299 scheduler_run_core   INFO     | Licenses currently active : 4
2023-12-26 12:15:04,299 scheduler_run_core   INFO     | Now trying to schedule J44
2023-12-26 12:15:04,299 scheduler_run_job    INFO     |    Scheduling job to slurm
2023-12-26 12:15:05,261 scheduler_run_job    INFO     | Not a commercial instance - heartbeat set to 12 hours.
2023-12-26 12:15:05,346 scheduler_run_job    INFO     |      Launchable! -- Launching.
2023-12-26 12:15:05,351 set_job_status       INFO     | Status changed for P4.J44 from queued to launched
2023-12-26 12:15:05,352 app_stats_refresh    INFO     | Calling app stats refresh url http://cn1577:39022/api/actions/stats/refresh_job for project_uid P4, workspace_uid None, job_uid J44 with body {'projectUid': 'P4', 'jobUid': 'J44'}
2023-12-26 12:15:05,359 app_stats_refresh    INFO     | code 200, text {"success":true}
2023-12-26 12:15:05,399 run_job              INFO     |       Running P4 J44
2023-12-26 12:15:05,415 run_job              INFO     |         cmd: /usr/local/slurm/bin/sbatch /home/zhuj6/cryoEM/Test/CS-t20s/J44/queue_sub_script.sh
2023-12-26 12:15:06,683 scheduler_run_core   INFO     | Licenses currently active : 5
2023-12-26 12:15:06,684 scheduler_run_core   INFO     | Now trying to schedule J9
2023-12-26 12:15:06,684 scheduler_run_job    INFO     |    Scheduling job to slurm
2023-12-26 12:15:07,663 scheduler_run_job    INFO     | Not a commercial instance - heartbeat set to 12 hours.
2023-12-26 12:15:07,747 scheduler_run_job    INFO     |      Launchable! -- Launching.
2023-12-26 12:15:07,752 set_job_status       INFO     | Status changed for P13.J9 from queued to launched
2023-12-26 12:15:07,753 app_stats_refresh    INFO     | Calling app stats refresh url http://cn1577:39022/api/actions/stats/refresh_job for project_uid P13, workspace_uid None, job_uid J9 with body {'projectUid': 'P13', 'jobUid': 'J9'}
2023-12-26 12:15:07,762 app_stats_refresh    INFO     | code 200, text {"success":true}
2023-12-26 12:15:07,796 run_job              INFO     |       Running P13 J9
2023-12-26 12:15:07,811 run_job              INFO     |         cmd: /usr/local/slurm/bin/sbatch /home/zhuj6/cryoEM/SRRP/CS-srrp/J9/queue_sub_script.sh
2023-12-26 12:15:09,229 scheduler_run_core   INFO     | Finished
2023-12-26 12:15:16,910 run                  INFO     | Received task dump_job_database with 2 args and 0 kwargs
2023-12-26 12:15:16,910 dump_job_database    INFO     | Request to export P13 J10
2023-12-26 12:15:16,916 dump_job_database    INFO     |    Exporting job to /home/zhuj6/cryoEM/SRRP/CS-srrp/J10
2023-12-26 12:15:16,917 dump_job_database    INFO     |    Exporting all of job's images in the database to /home/zhuj6/cryoEM/SRRP/CS-srrp/J10/gridfs_data...
2023-12-26 12:15:16,918 dump_job_database    INFO     |    Done. Exported 0 images in 0.00s
2023-12-26 12:15:16,918 dump_job_database    INFO     |    Exporting all job's streamlog events...
2023-12-26 12:15:16,922 dump_job_database    INFO     |    Done. Exported 1 files in 0.00s
2023-12-26 12:15:16,923 dump_job_database    INFO     |    Exporting job metafile...
2023-12-26 12:15:16,928 dump_job_database    INFO     |    Done. Exported in 0.01s
2023-12-26 12:15:16,928 dump_job_database    INFO     |    Updating job manifest...
2023-12-26 12:15:16,933 dump_job_database    INFO     |    Done. Updated in 0.00s
2023-12-26 12:15:16,933 dump_job_database    INFO     | Exported P13 J10 in 0.02s
2023-12-26 12:15:16,934 run                  INFO     | Completed task in 0.023966312408447266 seconds
2023-12-26 12:15:51,296 set_job_status       INFO     | Status changed for P4.J44 from launched to started
2023-12-26 12:15:51,297 app_stats_refresh    INFO     | Calling app stats refresh url http://cn1577:39022/api/actions/stats/refresh_job for project_uid P4, workspace_uid None, job_uid J44 with body {'projectUid': 'P4', 'jobUid': 'J44'}
2023-12-26 12:15:51,303 app_stats_refresh    INFO     | code 200, text {"success":true}
2023-12-26 12:15:54,679 set_job_status       INFO     | Status changed for P4.J44 from started to running
2023-12-26 12:15:54,680 app_stats_refresh    INFO     | Calling app stats refresh url http://cn1577:39022/api/actions/stats/refresh_job for project_uid P4, workspace_uid None, job_uid J44 with body {'projectUid': 'P4', 'jobUid': 'J44'}
2023-12-26 12:15:54,685 app_stats_refresh    INFO     | code 200, text {"success":true}
2023-12-26 12:15:57,647 set_job_status       INFO     | Status changed for P13.J9 from launched to started
2023-12-26 12:15:57,648 app_stats_refresh    INFO     | Calling app stats refresh url http://cn1577:39022/api/actions/stats/refresh_job for project_uid P13, workspace_uid None, job_uid J9 with body {'projectUid': 'P13', 'jobUid': 'J9'}
2023-12-26 12:15:57,655 app_stats_refresh    INFO     | code 200, text {"success":true}
2023-12-26 12:16:06,167 set_job_status       INFO     | Status changed for P13.J9 from started to running
2023-12-26 12:16:06,168 app_stats_refresh    INFO     | Calling app stats refresh url http://cn1577:39022/api/actions/stats/refresh_job for project_uid P13, workspace_uid None, job_uid J9 with body {'projectUid': 'P13', 'jobUid': 'J9'}
2023-12-26 12:16:06,175 app_stats_refresh    INFO     | code 200, text {"success":true}
2023-12-26 12:22:00,671 dump_job_database    INFO     | Request to export P4 J44
2023-12-26 12:22:00,675 dump_job_database    INFO     |    Exporting job to /home/zhuj6/cryoEM/Test/CS-t20s/J44
2023-12-26 12:22:00,676 dump_job_database    INFO     |    Exporting all of job's images in the database to /home/zhuj6/cryoEM/Test/CS-t20s/J44/gridfs_data...
2023-12-26 12:22:00,772 dump_job_database    INFO     |    Writing 53 database images to /home/zhuj6/cryoEM/Test/CS-t20s/J44/gridfs_data/gridfsdata_0
2023-12-26 12:22:00,772 dump_job_database    INFO     |    Done. Exported 53 images in 0.10s
2023-12-26 12:22:00,772 dump_job_database    INFO     |    Exporting all job's streamlog events...
2023-12-26 12:22:00,780 dump_job_database    INFO     |    Done. Exported 1 files in 0.01s
2023-12-26 12:22:00,780 dump_job_database    INFO     |    Exporting job metafile...
2023-12-26 12:22:00,782 dump_job_database    INFO     |    Creating .csg file for micrographs
2023-12-26 12:22:00,800 dump_job_database    INFO     |    Done. Exported in 0.02s
2023-12-26 12:22:00,801 dump_job_database    INFO     |    Updating job manifest...
2023-12-26 12:22:00,806 dump_job_database    INFO     |    Done. Updated in 0.01s
2023-12-26 12:22:00,807 dump_job_database    INFO     | Exported P4 J44 in 0.14s
2023-12-26 12:22:00,824 set_job_status       INFO     | Status changed for P4.J44 from running to completed
2023-12-26 12:22:00,828 app_stats_refresh    INFO     | Calling app stats refresh url http://cn1577:39022/api/actions/stats/refresh_job for project_uid P4, workspace_uid None, job_uid J44 with body {'projectUid': 'P4', 'jobUid': 'J44'}
2023-12-26 12:22:00,833 app_stats_refresh    INFO     | code 200, text {"success":true}

Is this an uninterrupted excerpt from command_core.log, and there were no entries for more than 2 hours?

That is right. Nothing happened in between. I have been struggling with this ever since the version update. Every queued job must be cleared and resubmitted after the previous job is finished.

@jhzhu How much RAM does the CryoSPARC master server have, and what other workloads does it handle? What is the output of the command
free -g
?

Everything looks fine to me on the server.

free -g
              total        used        free      shared  buff/cache   available
Mem:            251          40          41           0         170         209
Swap:             1           0           1

@jhzhu What are the outputs of these commands

ps -weo pid,ppid,start,rsz,vsz,rsz,%mem,cmd | grep cryosparc
uname -a
cat /sys/kernel/mm/transparent_hugepage/enabled 
ps -weo pid,ppid,start,rsz,vsz,rsz,%mem,cmd | grep cryosparc
1473619       1   Dec 28 26516  56116 26516  0.0 python /vf/users/huangrk/apps/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/supervisord -c /vf/users/huangrk/apps/cryosparc/cryosparc_master/supervisord.conf
1473742 1473619   Dec 28 4255688 5410324 4255688  1.6 mongod --auth --dbpath /vf/users/huangrk/apps/cryosparc/cryosparc_database --port 39023 --oplogSize 64 --replSet meteor --wiredTigerCacheSizeGB 4 --bind_ip_all
1473859 1473619   Dec 28 222536 1043952 222536  0.0 python -c import cryosparc_command.command_core as serv; serv.start(port=39024)
1473886 1473619   Dec 28 266252 1475460 266252  0.1 python -c import cryosparc_command.command_vis as serv; serv.start(port=39025)
1473890 1473619   Dec 28 201420 837692 201420  0.0 python -c import cryosparc_command.command_rtp as serv; serv.start(port=39027)
1474029 1473619   Dec 28 155292 1084868 155292  0.0 /vf/users/huangrk/apps/cryosparc/cryosparc_master/cryosparc_app/nodejs/bin/node ./bundle/main.js
1474143       1   Dec 28  3180  12868  3180  0.0 /bin/bash /usr/local/apps/cryosparc/libexec/auto_archive.sh
1758311       1   Dec 31 26456  56116 26456  0.0 python /vf/users/zhuj6/apps/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/supervisord -c /vf/users/zhuj6/apps/cryosparc/cryosparc_master/supervisord.conf
1758440 1758311   Dec 31 4855528 6437444 4855528  1.8 mongod --auth --dbpath /vf/users/zhuj6/apps/cryosparc/cryosparc_database --port 39052 --oplogSize 64 --replSet meteor --wiredTigerCacheSizeGB 4 --bind_ip_all
1758558 1758311   Dec 31 4927764 5610456 4927764  1.8 python -c import cryosparc_command.command_core as serv; serv.start(port=39053)
1758588 1758311   Dec 31 302420 1531528 302420  0.1 python -c import cryosparc_command.command_vis as serv; serv.start(port=39054)
1758592 1758311   Dec 31 214740 837624 214740  0.0 python -c import cryosparc_command.command_rtp as serv; serv.start(port=39056)
1758663 1758311   Dec 31 285560 1130608 285560  0.1 /vf/users/zhuj6/apps/cryosparc/cryosparc_master/cryosparc_app/nodejs/bin/node ./bundle/main.js
1758713       1   Dec 31  3184  12872  3184  0.0 /bin/bash /usr/local/apps/cryosparc/libexec/auto_archive.sh
1887778 1473859 12:30:39 226548 564252 226548  0.0 python -c import cryosparc_compute.run as run; run.run() --project P12 --job J72 --master_hostname cn1555 --master_command_core_port 39024
1887814 1887778 12:30:41 2551620 3771224 2551620  0.9 python -c import cryosparc_compute.run as run; run.run() --project P12 --job J72 --master_hostname cn1555 --master_command_core_port 39024
1956057 1953075 11:23:01  1164  12148  1164  0.0 grep --color=auto --exclude-dir=.bzr --exclude-dir=CVS --exclude-dir=.git --exclude-dir=.hg --exclude-dir=.svn --exclude-dir=.idea --exclude-dir=.tox cryosparc
uname -a
Linux cn1555 4.18.0-425.19.2.el8_7.x86_64 #1 SMP Tue Apr 4 22:38:11 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never

@jhzhu You may want to try

  1. completely shutting down the relevant CryoSPARC instance. Because there seems to be, according to the ps output, an additional CryoSPARC instance running on this computer, be sure to only terminate processes, if necessary, belonging to the specific CryoSPARC instance in question.
  2. then starting the CryoSPARC instance again.
  3. then try queuing new jobs.

If you continue to observe job launch failures, please email us compressed copies of the relevant CryoSPARC instance’s cryosparc_worker/run/command_core.log* files that include entries beginning at complete CryoSPARC shutdown/restart mentioned above. I will let you know the email address via private message.

Shut down the server and restart it on a different HPC node. Still have the same problem. I have sent the command_core.log by email.

Thanks for sending the logs.

May I ask:
Do the CryoSPARC master services as a slurm job?
Do you update the CRYOSPARC_MASTER_HOSTNAME inside cryosparc_master/config.sh every time you start CryoSPARC on a different node?

What are the affected (“waiting”) CryoSPARC jobs’ project and job UIDs?

Do the CryoSPARC master services as a slurm job?

Yes.

Do you update the CRYOSPARC_MASTER_HOSTNAME inside cryosparc_master/config.sh every time you start CryoSPARC on a different node?

I don’t think so. I will check with my admin.
Edit: My admin told me the hostname and port are updated every time the server is started.

What are the affected (“waiting”) CryoSPARC jobs’ project and job UIDs?

P16, J70.

Interestingly, we found no log record of P16 J70 having been queued. Please can you post the output of the command

cryosparcm cli "get_job('P16', 'J70', 'queued_at', 'launched_at', 'started_at')"
cryosparcm cli "get_job('P16', 'J70', 'queued_at', 'launched_at', 'started_at')"
{'_id': '659d95ca81a656dde2f4e484', 'launched_at': None, 'project_uid': 'P16', 'queued_at': 'Tue, 09 Jan 2024 18:52:33 GMT', 'started_at': None, 'uid': 'J70'}

Thanks @jhzhu. Please can you also post the output of this command:

cryosparcm cli "get_job('P16', 'J70', 'created_at')"
date

The latter command’s output should help me to match the GMT time from the command output to the log timestamps.

cryosparcm cli "get_job('P16', 'J70', 'created_at')"
{'_id': '659d95ca81a656dde2f4e484', 'created_at': 'Tue, 09 Jan 2024 18:51:54 GMT', 'project_uid': 'P16', 'uid': 'J70'}
date
Fri Jan 12 16:43:55 EST 2024

@jhzhu I sent you a direct message regarding a software patch that enables additional logging.

Can’t install the patch.

cryosparcm patch --install
/vf/users/zhuj6/apps/cryosparc/cryosparc_master/cryosparc_tools/cryosparc/command.py:134: UserWarning: *** CommandClient: (http://cn1641:39024/api) URL Error [Errno 111] Connection refused, attempt 1 of 3. Retrying in 30 seconds
  system = self._get_callable("system.describe")()
/vf/users/zhuj6/apps/cryosparc/cryosparc_master/cryosparc_tools/cryosparc/command.py:134: UserWarning: *** CommandClient: (http://cn1641:39024/api) URL Error [Errno 111] Connection refused, attempt 2 of 3. Retrying in 30 seconds
  system = self._get_callable("system.describe")()
Traceback (most recent call last):
  File "/vf/users/zhuj6/apps/cryosparc/cryosparc_master/cryosparc_tools/cryosparc/command.py", line 104, in func
    with make_json_request(self, "/api", data=data, _stacklevel=4) as request:
  File "/vf/users/zhuj6/apps/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/vf/users/zhuj6/apps/cryosparc/cryosparc_master/cryosparc_tools/cryosparc/command.py", line 225, in make_request
    raise CommandError(error_reason, url=url, code=code, data=resdata)
cryosparc_tools.cryosparc.errors.CommandError: *** (http://cn1641:39024/api, code 500) URL Error [Errno 111] Connection refused

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/vf/users/zhuj6/apps/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/vf/users/zhuj6/apps/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/vf/users/zhuj6/apps/cryosparc/cryosparc_master/patch.py", line 34, in <module>
    CLI = client.CommandClient(
  File "/vf/users/zhuj6/apps/cryosparc/cryosparc_master/cryosparc_compute/client.py", line 38, in __init__
    super().__init__(service, host, port, url, timeout, headers, cls=NumpyEncoder)
  File "/vf/users/zhuj6/apps/cryosparc/cryosparc_master/cryosparc_tools/cryosparc/command.py", line 96, in __init__
    self._reload()  # attempt connection immediately to gather methods
  File "/vf/users/zhuj6/apps/cryosparc/cryosparc_master/cryosparc_tools/cryosparc/command.py", line 134, in _reload
    system = self._get_callable("system.describe")()
  File "/vf/users/zhuj6/apps/cryosparc/cryosparc_master/cryosparc_tools/cryosparc/command.py", line 107, in func
    raise CommandError(
cryosparc_tools.cryosparc.errors.CommandError: *** (http://cn1641:39024, code 500) Encounted error from JSONRPC function "system.describe" with params ()

Please can you try again with CryoSPARC running?

cd /vf/users/zhuj6/apps/cryosparc/cryosparc_master/
./bin/cryosparcm start
./bin/cryosparcm patch --install

With cryoSPARC running.

cryosparcm patch --install
Patching master...
Done.
Skipped worker patch.
[1]    1429010 segmentation fault (core dumped)  cryosparcm patch --install

May I suggest

  • assuming, for now, that this segmentation fault does not indicate a patching failure and that the patch was applied correctly on cryosparc_master/. (This assumption may or may not have to be reviewed/revised later :neutral_face: )
  • proceeding to the other steps in the privately shared instructions:
    • cryosparc_master/config.sh modification to enable debug logging
    • cryosparcm restart
    • patch of cryosparc_worker/
    • sending us logs once you observe the problem again
    • disabling debug logging