Hi @witempel
Thank you for the response. Our IT department managed to fix cryosparc by clearing the DB cache, and the proposed long-term solution is to move the database to a high-performance file system. Could you please tell me if there is any limitation to the size of MongoDB? We currently have ~300 projects and add ~2 datasets per week. I wonder if we can hit the wall in the future.
The output is below.
Sincerely,
Sergei
[pletnevs@ai-hpccryoprd3 bin]$ csprojectid=P1
csjobid=J641
ps -eo user,pid,ppid,start,rsz,vsz,command | grep -e cryosparc_ -e mongo
eval $(cryosparcm env) # load cryosparc environment
./cryosparcm eventlog $csprojectid $csjobid | tail -n 40
./cryosparcm cli "get_job('$csprojectid', '$csjobid', 'job_type', 'version', 'instance_information', 'status', 'params_spec', 'errors_run', 'started_at')"
./cryosparcm cli "get_scheduler_targets()"
curl ai-hpccryoprd3.niaid.nih.gov:39002
./cryosparcm log supervisord | tail -n 40
echo $CRYOSPARC_DB_PATH
du -sh $CRYOSPARC_DB_PATH
grep "$(df $CRYOSPARC_DB_PATH | tail -n 1 | awk '{print $NF}') " /proc/mounts
exit # exit the shell after recording outputs
svc_hpc+ 44879 1 Oct 30 25368 41288 python /data/home/svc_hpccryoprd3/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/supervisord -c /data/home/svc_hpccryoprd3/cryosparc_master/supervisord.conf
svc_hpc+ 45192 44879 Oct 30 4149576 5876640 mongod --auth --dbpath /var/lib/CryoSPARCv3 --port 39001 --oplogSize 64 --replSet meteor --wiredTigerCacheSizeGB 4 --bind_ip_all
svc_hpc+ 45303 44879 Oct 30 740132 1752764 python -c import cryosparc_command.command_core as serv; serv.start(port=39002)
svc_hpc+ 45346 44879 Oct 30 304536 1454660 python -c import cryosparc_command.command_vis as serv; serv.start(port=39003)
svc_hpc+ 45350 44879 Oct 30 236816 939720 python -c import cryosparc_command.command_rtp as serv; serv.start(port=39005)
svc_hpc+ 45814 44879 Oct 30 524240 1479708 /data/home/svc_hpccryoprd3/cryosparc_master/cryosparc_app/nodejs/bin/node ./bundle/main.js
svc_hpc+ 341440 45303 10:03:40 227440 567212 python -c import cryosparc_compute.run as run; run.run() --project P267 --job J106 --master_hostname ai-hpccryoprd3.niaid.nih.gov --master_command_core_port 39002
svc_hpc+ 341479 341440 10:03:47 417772 1705928 python -c import cryosparc_compute.run as run; run.run() --project P267 --job J106 --master_hostname ai-hpccryoprd3.niaid.nih.gov --master_command_core_port 39002
svc_hpc+ 500730 45303 17:48:25 224428 567216 python -c import cryosparc_compute.run as run; run.run() --project P267 --job J120 --master_hostname ai-hpccryoprd3.niaid.nih.gov --master_command_core_port 39002
svc_hpc+ 500757 500730 17:48:28 1149020 2440360 python -c import cryosparc_compute.run as run; run.run() --project P267 --job J120 --master_hostname ai-hpccryoprd3.niaid.nih.gov --master_command_core_port 39002
pletnevs 502984 502272 17:57:23 2192 6272 grep --color=auto -e cryosparc_ -e mongo
-bash: cryosparcm: command not found
Viewing Direction Distribution Iteration 006
Posterior Precision Directional Distribution Iteration 006
[CPU RAM used: 46726 MB] Done in 45.903s.
[CPU RAM used: 46726 MB] Outputting files..
[CPU RAM used: 45976 MB] Done in 27.214s.
[CPU RAM used: 45976 MB] Done iteration 6 in 420.968s. Total time so far 2536.106s
[CPU RAM used: 45976 MB] ====== Done Refinement ======
[CPU RAM used: 45976 MB] Note that the output structure from refinement has been
lowpass filtered to the gold-standard FSC resolution estimated
above. However, the structure probably needs sharpening in
order to best visualize high resolution features. This can be
done with the Sharpening task (as a new experiment).
[CPU RAM used: 45976 MB] Full run took 2536.697s
[CPU RAM used: 13203 MB] --------------------------------------------------------------
[CPU RAM used: 13203 MB] Compiling job outputs...
[CPU RAM used: 13204 MB] Passing through outputs for output group particles from input group particles
[CPU RAM used: 13206 MB] This job outputted results ['alignments3D', 'ctf']
[CPU RAM used: 13206 MB] Loaded output dset with 6828 items
[CPU RAM used: 13206 MB] Passthrough results ['blob', 'alignments2D', 'pick_stats', 'location']
[CPU RAM used: 13207 MB] Loaded passthrough dset with 6828 items
[CPU RAM used: 13207 MB] Intersection of output and passthrough has 6828 items
[CPU RAM used: 13207 MB] Checking outputs for output group particles
[CPU RAM used: 13210 MB] Updating job size...
[CPU RAM used: 13209 MB] Exporting job and creating csg files...
[CPU RAM used: 13209 MB] Traceback (most recent call last):
File "/data/home/svc_hpccryoprd3/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 104, in func
with make_json_request(self, "/api", data=data, _stacklevel=4) as request:
File "/data/home/svc_hpccryoprd3/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/contextlib.py", line 113, in __enter__
return next(self.gen)
File "/data/home/svc_hpccryoprd3/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 225, in make_request
raise CommandError(error_reason, url=url, code=code, data=resdata)
cryosparc_tools.cryosparc.errors.CommandError: *** (http://ai-hpccryoprd3.niaid.nih.gov:39002/api, code 500) Timeout Error
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "cryosparc_master/cryosparc_compute/run.py", line 105, in cryosparc_master.cryosparc_compute.run.main
File "/data/home/svc_hpccryoprd3/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 107, in func
raise CommandError(
cryosparc_tools.cryosparc.errors.CommandError: *** (http://ai-hpccryoprd3.niaid.nih.gov:39002, code 500) Encounted error from JSONRPC function "dump_job_database" with params {'project_uid': 'P1', 'job_uid': 'J641', 'job_completed': True}
{'_id': '671ff8f69657052c27f1a948', 'errors_run': [{'message': '*** (http://ai-hpccryoprd3.niaid.nih.gov:39002, code 500) Encounted error from JSONRPC function "dump_job_database" with params {\'project_uid\': \'P1\', \'job_uid\': \'J641\', \'job_completed\': True}', 'warning': False}], 'instance_information': {'CUDA_version': '11.8', 'available_memory': '1.40TB', 'cpu_model': 'Intel(R) Xeon(R) Gold 5320 CPU @ 2.20GHz', 'driver_version': '12.3', 'gpu_info': [{'id': 0, 'mem': 50789154816, 'name': 'NVIDIA L40'}], 'ofd_hard_limit': 131072, 'ofd_soft_limit': 1024, 'physical_cores': 52, 'platform_architecture': 'x86_64', 'platform_node': 'ai-hpcgpu32.niaid.nih.gov', 'platform_release': '5.14.0-284.30.1.el9_2.x86_64', 'platform_version': '#1 SMP PREEMPT_DYNAMIC Sat Sep 16 09:55:41 UTC 2023', 'total_memory': '1.48TB', 'used_memory': '72.50GB'}, 'job_type': 'homo_refine_new', 'params_spec': {'refine_symmetry': {'value': 'I'}}, 'project_uid': 'P1', 'started_at': 'Mon, 28 Oct 2024 20:50:56 GMT', 'status': 'failed', 'uid': 'J641', 'version': 'v4.4.1+240110'}
[{'cache_path': '/scratch', 'cache_quota_mb': None, 'cache_reserve_mb': 10000000, 'custom_var_names': ['ram_multiplier', 'cpu_multiplier'], 'custom_vars': {}, 'desc': None, 'hostname': 'skyline-gpu', 'lane': 'skyline-gpu', 'name': 'skyline-gpu', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }} - the complete command string to run the job\n## {{ num_cpu }} - the number of CPUs needed\n## {{ num_gpu }} - the number of GPUs needed. \n## Note: the code will use this many GPUs starting from dev id 0\n## the cluster scheduler or this script have the responsibility\n## of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n## using the correct cluster-allocated GPUs.\n## {{ ram_gb }} - the amount of RAM needed in GB\n## {{ job_dir_abs }} - absolute path to the job directory\n## {{ project_dir_abs }} - absolute path to the project dir\n## {{ job_log_path_abs }} - absolute path to the log file for the job\n## {{ worker_bin_path }} - absolute path to the cryosparc worker command\n## {{ run_args }} - arguments to be passed to cryosparcw run\n## {{ project_uid }} - uid of the project\n## {{ job_uid }} - uid of the job\n## {{ job_creator }} - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## Available custom variables\n## {{ ram_multiplier }} Multiple the job types ram_gb by this\n## {{ cpu_multiplier }} Multiple the job types num_cpu by this\n##\n## What follows is a simple SLURM script:\n\n#SBATCH --job-name "cryosparc_{{ project_uid }}_{{ job_uid }}_{{ job_creator }}"\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=1\n#SBATCH --partition=gpu\n#SBATCH --gres=gpu:{{ [num_gpu, 4] | min }}\n#SBATCH --cpus-per-task={{ [(num_cpu|float * cpu_multiplier|default(1)|float)|int,96] | min }}\n#SBATCH --mem={{ (ram_gb|float * ram_multiplier|default(1)|float)|int }}G\n#SBATCH --output=slurm_logs/%x-{{ cryosparc_username }}-%N-%j-stdout.log\n#SBATCH --error=slurm_logs/%x-{{ cryosparc_username }}-%N-%j-stderr.log\n## cause jobs to launch even in the face of maintenance\n#SBATCH --time-min=1-0\nscontrol show job --json -d ${SLURM_JOB_ID} |jq \'.jobs[].gres_detail\'\n\nsrun {{ run_cmd }}\n', 'send_cmd_tpl': '{{ command }}', 'title': 'skyline-gpu', 'tpl_vars': ['num_gpu', 'project_uid', 'job_creator', 'project_dir_abs', 'job_uid', 'cryosparc_username', 'ram_multiplier', 'ram_gb', 'job_dir_abs', 'worker_bin_path', 'cluster_job_id', 'run_args', 'job_log_path_abs', 'cpu_multiplier', 'run_cmd', 'num_cpu', 'command'], 'type': 'cluster', 'worker_bin_path': '/data/home/svc_hpccryoprd3/cryosparc_worker/bin/cryosparcw'}, {'cache_path': '/scratch', 'cache_quota_mb': None, 'cache_reserve_mb': 10000000, 'custom_var_names': ['ram_multiplier', 'cpu_multiplier'], 'custom_vars': {}, 'desc': None, 'hostname': 'skyline-gpu_l40', 'lane': 'skyline-gpu_l40', 'name': 'skyline-gpu_l40', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }} - the complete command string to run the job\n## {{ num_cpu }} - the number of CPUs needed\n## {{ num_gpu }} - the number of GPUs needed. \n## Note: the code will use this many GPUs starting from dev id 0\n## the cluster scheduler or this script have the responsibility\n## of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n## using the correct cluster-allocated GPUs.\n## {{ ram_gb }} - the amount of RAM needed in GB\n## {{ job_dir_abs }} - absolute path to the job directory\n## {{ project_dir_abs }} - absolute path to the project dir\n## {{ job_log_path_abs }} - absolute path to the log file for the job\n## {{ worker_bin_path }} - absolute path to the cryosparc worker command\n## {{ run_args }} - arguments to be passed to cryosparcw run\n## {{ project_uid }} - uid of the project\n## {{ job_uid }} - uid of the job\n## {{ job_creator }} - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## Available custom variables\n## {{ ram_multiplier }} Multiple the job types ram_gb by this\n## {{ cpu_multiplier }} Multiple the job types num_cpu by this\n##\n## What follows is a simple SLURM script:\n\n#SBATCH --job-name "cryosparc_{{ project_uid }}_{{ job_uid }}_{{ job_creator }}"\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=1\n#SBATCH --partition=gpu_l40\n#SBATCH --gres=gpu:{{ [num_gpu, 8] | min }}\n#SBATCH --cpus-per-task={{ [(num_cpu|float * cpu_multiplier|default(1)|float)|int,52] | min }}\n#SBATCH --mem={{ (ram_gb|float * ram_multiplier|default(1)|float)|int }}G\n#SBATCH --output=slurm_logs/%x-{{ cryosparc_username }}-%N-%j-stdout.log\n#SBATCH --error=slurm_logs/%x-{{ cryosparc_username }}-%N-%j-stderr.log\n## cause jobs to launch even in the face of maintenance\n#SBATCH --time-min=1-0\nscontrol show job --json -d ${SLURM_JOB_ID} |jq \'.jobs[].gres_detail\'\n\nsrun {{ run_cmd }}\n', 'send_cmd_tpl': '{{ command }}', 'title': 'skyline-gpu_l40', 'tpl_vars': ['num_gpu', 'project_uid', 'job_creator', 'project_dir_abs', 'job_uid', 'cryosparc_username', 'ram_multiplier', 'ram_gb', 'job_dir_abs', 'worker_bin_path', 'cluster_job_id', 'run_args', 'job_log_path_abs', 'cpu_multiplier', 'run_cmd', 'num_cpu', 'command'], 'type': 'cluster', 'worker_bin_path': '/data/home/svc_hpccryoprd3/cryosparc_worker/bin/cryosparcw'}, {'cache_path': '/scratch', 'cache_quota_mb': None, 'cache_reserve_mb': 10000000, 'custom_var_names': ['ram_multiplier', 'cpu_multiplier'], 'custom_vars': {}, 'desc': None, 'hostname': 'skyline-all,himem', 'lane': 'skyline-all,himem', 'name': 'skyline-all,himem', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }} - the complete command string to run the job\n## {{ num_cpu }} - the number of CPUs needed\n## {{ num_gpu }} - the number of GPUs needed. \n## Note: the code will use this many GPUs starting from dev id 0\n## the cluster scheduler or this script have the responsibility\n## of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n## using the correct cluster-allocated GPUs.\n## {{ ram_gb }} - the amount of RAM needed in GB\n## {{ job_dir_abs }} - absolute path to the job directory\n## {{ project_dir_abs }} - absolute path to the project dir\n## {{ job_log_path_abs }} - absolute path to the log file for the job\n## {{ worker_bin_path }} - absolute path to the cryosparc worker command\n## {{ run_args }} - arguments to be passed to cryosparcw run\n## {{ project_uid }} - uid of the project\n## {{ job_uid }} - uid of the job\n## {{ job_creator }} - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## Available custom variables\n## {{ ram_multiplier }} Multiple the job types ram_gb by this\n## {{ cpu_multiplier }} Multiple the job types num_cpu by this\n##\n## What follows is a simple SLURM script:\n\n#SBATCH --job-name "cryosparc_{{ project_uid }}_{{ job_uid }}_{{ job_creator }}"\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=1\n#SBATCH --partition=all,himem\n#SBATCH --cpus-per-task={{ [(num_cpu|float * cpu_multiplier|default(1)|float)|int,64] | min }}\n#SBATCH --mem={{ (ram_gb|float * ram_multiplier|default(1)|float)|int }}G\n#SBATCH --output=slurm_logs/%x-{{ cryosparc_username }}-%N-%j-stdout.log\n#SBATCH --error=slurm_logs/%x-{{ cryosparc_username }}-%N-%j-stderr.log\n## cause jobs to launch even in the face of maintenance\n#SBATCH --time-min=1-0\nscontrol show job --json -d ${SLURM_JOB_ID} |jq \'.jobs[].gres_detail\'\n\nsrun {{ run_cmd }}\n', 'send_cmd_tpl': '{{ command }}', 'title': 'skyline-all,himem', 'tpl_vars': ['num_gpu', 'project_uid', 'job_creator', 'project_dir_abs', 'job_uid', 'cryosparc_username', 'ram_multiplier', 'ram_gb', 'job_dir_abs', 'worker_bin_path', 'cluster_job_id', 'run_args', 'job_log_path_abs', 'cpu_multiplier', 'run_cmd', 'num_cpu', 'command'], 'type': 'cluster', 'worker_bin_path': '/data/home/svc_hpccryoprd3/cryosparc_worker/bin/cryosparcw'}]
Hello World from cryosparc command core.
2024-10-30 11:00:45,340 INFO spawned: 'database' with pid 8861
2024-10-30 11:00:47,316 INFO success: database entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-10-30 11:00:55,256 INFO spawned: 'command_core' with pid 8974
2024-10-30 11:01:00,263 INFO success: command_core entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)
2024-10-30 11:01:58,314 INFO spawned: 'command_vis' with pid 9249
2024-10-30 11:01:59,318 INFO success: command_vis entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-10-30 11:02:01,111 INFO spawned: 'command_rtp' with pid 9253
2024-10-30 11:02:02,112 INFO success: command_rtp entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-10-30 11:03:18,285 INFO spawned: 'app' with pid 11451
2024-10-30 11:03:19,288 INFO success: app entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-10-30 11:03:22,337 INFO spawned: 'app_api' with pid 11469
2024-10-30 11:03:23,368 INFO success: app_api entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-10-30 11:56:52,240 INFO waiting for app to stop
2024-10-30 11:56:52,241 INFO waiting for app_api to stop
2024-10-30 11:56:52,242 INFO waiting for command_core to stop
2024-10-30 11:56:52,242 INFO waiting for command_rtp to stop
2024-10-30 11:56:52,242 INFO waiting for command_vis to stop
2024-10-30 11:56:52,242 INFO waiting for database to stop
2024-10-30 11:56:52,319 WARN stopped: app (terminated by SIGTERM)
2024-10-30 11:56:52,319 WARN stopped: app_api (terminated by SIGTERM)
2024-10-30 11:56:52,332 WARN stopped: command_rtp (terminated by SIGQUIT (core dumped))
2024-10-30 11:56:52,653 WARN stopped: command_core (terminated by SIGQUIT (core dumped))
2024-10-30 11:56:52,821 WARN stopped: command_vis (terminated by SIGQUIT (core dumped))
2024-10-30 11:56:53,836 INFO stopped: database (exit status 0)
2024-10-30 11:57:42,845 INFO RPC interface 'supervisor' initialized
2024-10-30 11:57:42,845 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2024-10-30 11:57:42,846 INFO daemonizing the supervisord process
2024-10-30 11:57:42,861 INFO supervisord started with pid 44879
2024-10-30 11:57:58,582 INFO spawned: 'database' with pid 45192
2024-10-30 11:58:00,476 INFO success: database entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-10-30 11:58:06,818 INFO spawned: 'command_core' with pid 45303
2024-10-30 11:58:11,825 INFO success: command_core entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)
2024-10-30 11:58:32,037 INFO spawned: 'command_vis' with pid 45346
2024-10-30 11:58:33,038 INFO success: command_vis entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-10-30 11:58:34,412 INFO spawned: 'command_rtp' with pid 45350
2024-10-30 11:58:35,414 INFO success: command_rtp entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-10-30 11:58:55,536 INFO spawned: 'app' with pid 45792
2024-10-30 11:58:56,537 INFO success: app entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-10-30 11:58:58,225 INFO spawned: 'app_api' with pid 45814
2024-10-30 11:58:59,227 INFO success: app_api entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
83K .
tmpfs /run/user/1109906 tmpfs rw,seclabel,nosuid,nodev,relatime,size=6583792k,nr_inodes=1645948,mode=700,uid=1109906,gid=1109906,inode64 0 0
logout
Connection to ai-hpccryoprd3.niaid.nih.gov closed.