Timeout error code 500

Sergei · October 29, 2024, 4:40pm

Dear cryosparc community and developers,
We recently bumped into a problem we could not solve. For some reason, all cryosparc jobs fail with the error below. The program is installed in the computer center (slurm environment), and the cryosparc-master runs on the dedicated virtual machine. We have about ten users, so the database is quite large. I would appreciate any help or advice.

Sincerely,
Sergei

[CPU: 13.21 GB]
Traceback (most recent call last):
File “/data/home/svc_hpccryoprd3/cryosparc_worker/cryosparc_tools/cryosparc/command.py”, line 104, in func
with make_json_request(self, “/api”, data=data, _stacklevel=4) as request:
File “/data/home/svc_hpccryoprd3/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/contextlib.py”, line 113, in enter
return next(self.gen)
File “/data/home/svc_hpccryoprd3/cryosparc_worker/cryosparc_tools/cryosparc/command.py”, line 225, in make_request
raise CommandError(error_reason, url=url, code=code, data=resdata)
cryosparc_tools.cryosparc.errors.CommandError: *** (http://ai-hpccryoprd3.niaid.nih.gov:39002/api, code 500) Timeout Error

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File “cryosparc_master/cryosparc_compute/run.py”, line 105, in cryosparc_master.cryosparc_compute.run.main
File “/data/home/svc_hpccryoprd3/cryosparc_worker/cryosparc_tools/cryosparc/command.py”, line 107, in func
raise CommandError(
cryosparc_tools.cryosparc.errors.CommandError: *** (http://ai-hpccryoprd3.niaid.nih.gov:39002, code 500) Encounted error from JSONRPC function “dump_job_database” with params {‘project_uid’: ‘P1’, ‘job_uid’: ‘J641’, ‘job_completed’: True}

wtempel · October 30, 2024, 7:56pm

@Sergei Please can you post additional details about the job that you can obtain from these commands (run on ai-hpccryoprd3 in a fresh shell):

csprojectid=P1
csjobid=J641
ps -eo user,pid,ppid,start,rsz,vsz,command | grep -e cryosparc_ -e mongo
eval $(cryosparcm env) # load cryosparc environment
cryosparcm eventlog $csprojectid $csjobid | tail -n 40
cryosparcm cli "get_job('$csprojectid', '$csjobid', 'job_type', 'version', 'instance_information', 'status',  'params_spec', 'errors_run', 'started_at')"
cryosparcm cli "get_scheduler_targets()"
curl ai-hpccryoprd3.niaid.nih.gov:39002
cryosparcm log supervisord | tail -n 40
echo $CRYOSPARC_DB_PATH
du -sh $CRYOSPARC_DB_PATH
grep "$(df $CRYOSPARC_DB_PATH | tail -n 1 | awk '{print $NF}') " /proc/mounts
exit # exit the shell after recording outputs

Sergei · October 31, 2024, 10:01pm

Hi @witempel

Thank you for the response. Our IT department managed to fix cryosparc by clearing the DB cache, and the proposed long-term solution is to move the database to a high-performance file system. Could you please tell me if there is any limitation to the size of MongoDB? We currently have ~300 projects and add ~2 datasets per week. I wonder if we can hit the wall in the future.

The output is below.

Sincerely,
Sergei

[pletnevs@ai-hpccryoprd3 bin]$ csprojectid=P1
csjobid=J641
ps -eo user,pid,ppid,start,rsz,vsz,command | grep -e cryosparc_ -e mongo
eval $(cryosparcm env) # load cryosparc environment
./cryosparcm eventlog $csprojectid $csjobid | tail -n 40
./cryosparcm cli "get_job('$csprojectid', '$csjobid', 'job_type', 'version', 'instance_information', 'status',  'params_spec', 'errors_run', 'started_at')"
./cryosparcm cli "get_scheduler_targets()"
curl ai-hpccryoprd3.niaid.nih.gov:39002
./cryosparcm log supervisord | tail -n 40
echo $CRYOSPARC_DB_PATH
du -sh $CRYOSPARC_DB_PATH
grep "$(df $CRYOSPARC_DB_PATH | tail -n 1 | awk '{print $NF}') " /proc/mounts
exit # exit the shell after recording outputs
svc_hpc+   44879       1   Oct 30 25368  41288 python /data/home/svc_hpccryoprd3/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/supervisord -c /data/home/svc_hpccryoprd3/cryosparc_master/supervisord.conf
svc_hpc+   45192   44879   Oct 30 4149576 5876640 mongod --auth --dbpath /var/lib/CryoSPARCv3 --port 39001 --oplogSize 64 --replSet meteor --wiredTigerCacheSizeGB 4 --bind_ip_all
svc_hpc+   45303   44879   Oct 30 740132 1752764 python -c import cryosparc_command.command_core as serv; serv.start(port=39002)
svc_hpc+   45346   44879   Oct 30 304536 1454660 python -c import cryosparc_command.command_vis as serv; serv.start(port=39003)
svc_hpc+   45350   44879   Oct 30 236816 939720 python -c import cryosparc_command.command_rtp as serv; serv.start(port=39005)
svc_hpc+   45814   44879   Oct 30 524240 1479708 /data/home/svc_hpccryoprd3/cryosparc_master/cryosparc_app/nodejs/bin/node ./bundle/main.js
svc_hpc+  341440   45303 10:03:40 227440 567212 python -c import cryosparc_compute.run as run; run.run() --project P267 --job J106 --master_hostname ai-hpccryoprd3.niaid.nih.gov --master_command_core_port 39002
svc_hpc+  341479  341440 10:03:47 417772 1705928 python -c import cryosparc_compute.run as run; run.run() --project P267 --job J106 --master_hostname ai-hpccryoprd3.niaid.nih.gov --master_command_core_port 39002
svc_hpc+  500730   45303 17:48:25 224428 567216 python -c import cryosparc_compute.run as run; run.run() --project P267 --job J120 --master_hostname ai-hpccryoprd3.niaid.nih.gov --master_command_core_port 39002
svc_hpc+  500757  500730 17:48:28 1149020 2440360 python -c import cryosparc_compute.run as run; run.run() --project P267 --job J120 --master_hostname ai-hpccryoprd3.niaid.nih.gov --master_command_core_port 39002
pletnevs  502984  502272 17:57:23  2192   6272 grep --color=auto -e cryosparc_ -e mongo
-bash: cryosparcm: command not found
Viewing Direction Distribution Iteration 006
Posterior Precision Directional Distribution Iteration 006
[CPU RAM used: 46726 MB]     Done in 45.903s.
[CPU RAM used: 46726 MB]   Outputting files..
[CPU RAM used: 45976 MB]     Done in 27.214s.
[CPU RAM used: 45976 MB] Done iteration 6 in 420.968s. Total time so far 2536.106s
[CPU RAM used: 45976 MB] ====== Done Refinement ======
[CPU RAM used: 45976 MB] Note that the output structure from refinement has been 
lowpass filtered to the gold-standard FSC resolution estimated 
above. However, the structure probably needs sharpening in 
order to best visualize high resolution features. This can be 
done with the Sharpening task (as a new experiment).
[CPU RAM used: 45976 MB] Full run took 2536.697s
[CPU RAM used: 13203 MB] --------------------------------------------------------------
[CPU RAM used: 13203 MB] Compiling job outputs...
[CPU RAM used: 13204 MB] Passing through outputs for output group particles from input group particles
[CPU RAM used: 13206 MB] This job outputted results ['alignments3D', 'ctf']
[CPU RAM used: 13206 MB]   Loaded output dset with 6828 items
[CPU RAM used: 13206 MB] Passthrough results ['blob', 'alignments2D', 'pick_stats', 'location']
[CPU RAM used: 13207 MB]   Loaded passthrough dset with 6828 items
[CPU RAM used: 13207 MB]   Intersection of output and passthrough has 6828 items
[CPU RAM used: 13207 MB] Checking outputs for output group particles
[CPU RAM used: 13210 MB] Updating job size...
[CPU RAM used: 13209 MB] Exporting job and creating csg files...
[CPU RAM used: 13209 MB] Traceback (most recent call last):
  File "/data/home/svc_hpccryoprd3/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 104, in func
    with make_json_request(self, "/api", data=data, _stacklevel=4) as request:
  File "/data/home/svc_hpccryoprd3/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/data/home/svc_hpccryoprd3/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 225, in make_request
    raise CommandError(error_reason, url=url, code=code, data=resdata)
cryosparc_tools.cryosparc.errors.CommandError: *** (http://ai-hpccryoprd3.niaid.nih.gov:39002/api, code 500) Timeout Error

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "cryosparc_master/cryosparc_compute/run.py", line 105, in cryosparc_master.cryosparc_compute.run.main
  File "/data/home/svc_hpccryoprd3/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 107, in func
    raise CommandError(
cryosparc_tools.cryosparc.errors.CommandError: *** (http://ai-hpccryoprd3.niaid.nih.gov:39002, code 500) Encounted error from JSONRPC function "dump_job_database" with params {'project_uid': 'P1', 'job_uid': 'J641', 'job_completed': True}
{'_id': '671ff8f69657052c27f1a948', 'errors_run': [{'message': '*** (http://ai-hpccryoprd3.niaid.nih.gov:39002, code 500) Encounted error from JSONRPC function "dump_job_database" with params {\'project_uid\': \'P1\', \'job_uid\': \'J641\', \'job_completed\': True}', 'warning': False}], 'instance_information': {'CUDA_version': '11.8', 'available_memory': '1.40TB', 'cpu_model': 'Intel(R) Xeon(R) Gold 5320 CPU @ 2.20GHz', 'driver_version': '12.3', 'gpu_info': [{'id': 0, 'mem': 50789154816, 'name': 'NVIDIA L40'}], 'ofd_hard_limit': 131072, 'ofd_soft_limit': 1024, 'physical_cores': 52, 'platform_architecture': 'x86_64', 'platform_node': 'ai-hpcgpu32.niaid.nih.gov', 'platform_release': '5.14.0-284.30.1.el9_2.x86_64', 'platform_version': '#1 SMP PREEMPT_DYNAMIC Sat Sep 16 09:55:41 UTC 2023', 'total_memory': '1.48TB', 'used_memory': '72.50GB'}, 'job_type': 'homo_refine_new', 'params_spec': {'refine_symmetry': {'value': 'I'}}, 'project_uid': 'P1', 'started_at': 'Mon, 28 Oct 2024 20:50:56 GMT', 'status': 'failed', 'uid': 'J641', 'version': 'v4.4.1+240110'}
[{'cache_path': '/scratch', 'cache_quota_mb': None, 'cache_reserve_mb': 10000000, 'custom_var_names': ['ram_multiplier', 'cpu_multiplier'], 'custom_vars': {}, 'desc': None, 'hostname': 'skyline-gpu', 'lane': 'skyline-gpu', 'name': 'skyline-gpu', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }}            - the complete command string to run the job\n## {{ num_cpu }}            - the number of CPUs needed\n## {{ num_gpu }}            - the number of GPUs needed. \n##                            Note: the code will use this many GPUs starting from dev id 0\n##                                  the cluster scheduler or this script have the responsibility\n##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n##                                  using the correct cluster-allocated GPUs.\n## {{ ram_gb }}             - the amount of RAM needed in GB\n## {{ job_dir_abs }}        - absolute path to the job directory\n## {{ project_dir_abs }}    - absolute path to the project dir\n## {{ job_log_path_abs }}   - absolute path to the log file for the job\n## {{ worker_bin_path }}    - absolute path to the cryosparc worker command\n## {{ run_args }}           - arguments to be passed to cryosparcw run\n## {{ project_uid }}        - uid of the project\n## {{ job_uid }}            - uid of the job\n## {{ job_creator }}        - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## Available custom variables\n## {{ ram_multiplier }}     Multiple the job types ram_gb by this\n## {{ cpu_multiplier }}     Multiple the job types num_cpu by this\n##\n## What follows is a simple SLURM script:\n\n#SBATCH --job-name "cryosparc_{{ project_uid }}_{{ job_uid }}_{{ job_creator }}"\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=1\n#SBATCH --partition=gpu\n#SBATCH --gres=gpu:{{ [num_gpu, 4] | min }}\n#SBATCH --cpus-per-task={{ [(num_cpu|float * cpu_multiplier|default(1)|float)|int,96] | min }}\n#SBATCH --mem={{ (ram_gb|float * ram_multiplier|default(1)|float)|int }}G\n#SBATCH --output=slurm_logs/%x-{{ cryosparc_username }}-%N-%j-stdout.log\n#SBATCH --error=slurm_logs/%x-{{ cryosparc_username }}-%N-%j-stderr.log\n## cause jobs to launch even in the face of maintenance\n#SBATCH --time-min=1-0\nscontrol show job --json -d ${SLURM_JOB_ID} |jq \'.jobs[].gres_detail\'\n\nsrun {{ run_cmd }}\n', 'send_cmd_tpl': '{{ command }}', 'title': 'skyline-gpu', 'tpl_vars': ['num_gpu', 'project_uid', 'job_creator', 'project_dir_abs', 'job_uid', 'cryosparc_username', 'ram_multiplier', 'ram_gb', 'job_dir_abs', 'worker_bin_path', 'cluster_job_id', 'run_args', 'job_log_path_abs', 'cpu_multiplier', 'run_cmd', 'num_cpu', 'command'], 'type': 'cluster', 'worker_bin_path': '/data/home/svc_hpccryoprd3/cryosparc_worker/bin/cryosparcw'}, {'cache_path': '/scratch', 'cache_quota_mb': None, 'cache_reserve_mb': 10000000, 'custom_var_names': ['ram_multiplier', 'cpu_multiplier'], 'custom_vars': {}, 'desc': None, 'hostname': 'skyline-gpu_l40', 'lane': 'skyline-gpu_l40', 'name': 'skyline-gpu_l40', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }}            - the complete command string to run the job\n## {{ num_cpu }}            - the number of CPUs needed\n## {{ num_gpu }}            - the number of GPUs needed. \n##                            Note: the code will use this many GPUs starting from dev id 0\n##                                  the cluster scheduler or this script have the responsibility\n##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n##                                  using the correct cluster-allocated GPUs.\n## {{ ram_gb }}             - the amount of RAM needed in GB\n## {{ job_dir_abs }}        - absolute path to the job directory\n## {{ project_dir_abs }}    - absolute path to the project dir\n## {{ job_log_path_abs }}   - absolute path to the log file for the job\n## {{ worker_bin_path }}    - absolute path to the cryosparc worker command\n## {{ run_args }}           - arguments to be passed to cryosparcw run\n## {{ project_uid }}        - uid of the project\n## {{ job_uid }}            - uid of the job\n## {{ job_creator }}        - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## Available custom variables\n## {{ ram_multiplier }}     Multiple the job types ram_gb by this\n## {{ cpu_multiplier }}     Multiple the job types num_cpu by this\n##\n## What follows is a simple SLURM script:\n\n#SBATCH --job-name "cryosparc_{{ project_uid }}_{{ job_uid }}_{{ job_creator }}"\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=1\n#SBATCH --partition=gpu_l40\n#SBATCH --gres=gpu:{{ [num_gpu, 8] | min }}\n#SBATCH --cpus-per-task={{ [(num_cpu|float * cpu_multiplier|default(1)|float)|int,52] | min }}\n#SBATCH --mem={{ (ram_gb|float * ram_multiplier|default(1)|float)|int }}G\n#SBATCH --output=slurm_logs/%x-{{ cryosparc_username }}-%N-%j-stdout.log\n#SBATCH --error=slurm_logs/%x-{{ cryosparc_username }}-%N-%j-stderr.log\n## cause jobs to launch even in the face of maintenance\n#SBATCH --time-min=1-0\nscontrol show job --json -d ${SLURM_JOB_ID} |jq \'.jobs[].gres_detail\'\n\nsrun {{ run_cmd }}\n', 'send_cmd_tpl': '{{ command }}', 'title': 'skyline-gpu_l40', 'tpl_vars': ['num_gpu', 'project_uid', 'job_creator', 'project_dir_abs', 'job_uid', 'cryosparc_username', 'ram_multiplier', 'ram_gb', 'job_dir_abs', 'worker_bin_path', 'cluster_job_id', 'run_args', 'job_log_path_abs', 'cpu_multiplier', 'run_cmd', 'num_cpu', 'command'], 'type': 'cluster', 'worker_bin_path': '/data/home/svc_hpccryoprd3/cryosparc_worker/bin/cryosparcw'}, {'cache_path': '/scratch', 'cache_quota_mb': None, 'cache_reserve_mb': 10000000, 'custom_var_names': ['ram_multiplier', 'cpu_multiplier'], 'custom_vars': {}, 'desc': None, 'hostname': 'skyline-all,himem', 'lane': 'skyline-all,himem', 'name': 'skyline-all,himem', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }}            - the complete command string to run the job\n## {{ num_cpu }}            - the number of CPUs needed\n## {{ num_gpu }}            - the number of GPUs needed. \n##                            Note: the code will use this many GPUs starting from dev id 0\n##                                  the cluster scheduler or this script have the responsibility\n##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n##                                  using the correct cluster-allocated GPUs.\n## {{ ram_gb }}             - the amount of RAM needed in GB\n## {{ job_dir_abs }}        - absolute path to the job directory\n## {{ project_dir_abs }}    - absolute path to the project dir\n## {{ job_log_path_abs }}   - absolute path to the log file for the job\n## {{ worker_bin_path }}    - absolute path to the cryosparc worker command\n## {{ run_args }}           - arguments to be passed to cryosparcw run\n## {{ project_uid }}        - uid of the project\n## {{ job_uid }}            - uid of the job\n## {{ job_creator }}        - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## Available custom variables\n## {{ ram_multiplier }}     Multiple the job types ram_gb by this\n## {{ cpu_multiplier }}     Multiple the job types num_cpu by this\n##\n## What follows is a simple SLURM script:\n\n#SBATCH --job-name "cryosparc_{{ project_uid }}_{{ job_uid }}_{{ job_creator }}"\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=1\n#SBATCH --partition=all,himem\n#SBATCH --cpus-per-task={{ [(num_cpu|float * cpu_multiplier|default(1)|float)|int,64] | min }}\n#SBATCH --mem={{ (ram_gb|float * ram_multiplier|default(1)|float)|int }}G\n#SBATCH --output=slurm_logs/%x-{{ cryosparc_username }}-%N-%j-stdout.log\n#SBATCH --error=slurm_logs/%x-{{ cryosparc_username }}-%N-%j-stderr.log\n## cause jobs to launch even in the face of maintenance\n#SBATCH --time-min=1-0\nscontrol show job --json -d ${SLURM_JOB_ID} |jq \'.jobs[].gres_detail\'\n\nsrun {{ run_cmd }}\n', 'send_cmd_tpl': '{{ command }}', 'title': 'skyline-all,himem', 'tpl_vars': ['num_gpu', 'project_uid', 'job_creator', 'project_dir_abs', 'job_uid', 'cryosparc_username', 'ram_multiplier', 'ram_gb', 'job_dir_abs', 'worker_bin_path', 'cluster_job_id', 'run_args', 'job_log_path_abs', 'cpu_multiplier', 'run_cmd', 'num_cpu', 'command'], 'type': 'cluster', 'worker_bin_path': '/data/home/svc_hpccryoprd3/cryosparc_worker/bin/cryosparcw'}]
Hello World from cryosparc command core.
2024-10-30 11:00:45,340 INFO spawned: 'database' with pid 8861
2024-10-30 11:00:47,316 INFO success: database entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-10-30 11:00:55,256 INFO spawned: 'command_core' with pid 8974
2024-10-30 11:01:00,263 INFO success: command_core entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)
2024-10-30 11:01:58,314 INFO spawned: 'command_vis' with pid 9249
2024-10-30 11:01:59,318 INFO success: command_vis entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-10-30 11:02:01,111 INFO spawned: 'command_rtp' with pid 9253
2024-10-30 11:02:02,112 INFO success: command_rtp entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-10-30 11:03:18,285 INFO spawned: 'app' with pid 11451
2024-10-30 11:03:19,288 INFO success: app entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-10-30 11:03:22,337 INFO spawned: 'app_api' with pid 11469
2024-10-30 11:03:23,368 INFO success: app_api entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-10-30 11:56:52,240 INFO waiting for app to stop
2024-10-30 11:56:52,241 INFO waiting for app_api to stop
2024-10-30 11:56:52,242 INFO waiting for command_core to stop
2024-10-30 11:56:52,242 INFO waiting for command_rtp to stop
2024-10-30 11:56:52,242 INFO waiting for command_vis to stop
2024-10-30 11:56:52,242 INFO waiting for database to stop
2024-10-30 11:56:52,319 WARN stopped: app (terminated by SIGTERM)
2024-10-30 11:56:52,319 WARN stopped: app_api (terminated by SIGTERM)
2024-10-30 11:56:52,332 WARN stopped: command_rtp (terminated by SIGQUIT (core dumped))
2024-10-30 11:56:52,653 WARN stopped: command_core (terminated by SIGQUIT (core dumped))
2024-10-30 11:56:52,821 WARN stopped: command_vis (terminated by SIGQUIT (core dumped))
2024-10-30 11:56:53,836 INFO stopped: database (exit status 0)
2024-10-30 11:57:42,845 INFO RPC interface 'supervisor' initialized
2024-10-30 11:57:42,845 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2024-10-30 11:57:42,846 INFO daemonizing the supervisord process
2024-10-30 11:57:42,861 INFO supervisord started with pid 44879
2024-10-30 11:57:58,582 INFO spawned: 'database' with pid 45192
2024-10-30 11:58:00,476 INFO success: database entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-10-30 11:58:06,818 INFO spawned: 'command_core' with pid 45303
2024-10-30 11:58:11,825 INFO success: command_core entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)
2024-10-30 11:58:32,037 INFO spawned: 'command_vis' with pid 45346
2024-10-30 11:58:33,038 INFO success: command_vis entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-10-30 11:58:34,412 INFO spawned: 'command_rtp' with pid 45350
2024-10-30 11:58:35,414 INFO success: command_rtp entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-10-30 11:58:55,536 INFO spawned: 'app' with pid 45792
2024-10-30 11:58:56,537 INFO success: app entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-10-30 11:58:58,225 INFO spawned: 'app_api' with pid 45814
2024-10-30 11:58:59,227 INFO success: app_api entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)

83K	.
tmpfs /run/user/1109906 tmpfs rw,seclabel,nosuid,nodev,relatime,size=6583792k,nr_inodes=1645948,mode=700,uid=1109906,gid=1109906,inode64 0 0
logout
Connection to ai-hpccryoprd3.niaid.nih.gov closed.

wtempel · November 1, 2024, 2:07pm

Thanks @Sergei. May I ask a few follow-up questions.

Were you able to confirm that this action resolved the code 500 error in your original post?
Do you know or can you find out what commands your IT dept used?

What kind of filesystem do your currently used for the database? These commands may show you:

df -h /var/lib/CryoSPARCv3/
grep "$(df /var/lib/CryoSPARCv3/ | tail -n 1 | awk '{print $NF}') " /proc/mounts

How much space does the database use currently? What is the output of the command:

du -sh /var/lib/CryoSPARCv3/

Sergei · November 2, 2024, 6:35pm

Hi @witempel,

Yes, the timeout issue is resolved, and cryosparc is stable. Unfortunately, I do not know which commands IT used, but I was told IT followed the recommendations/procedures outlined on the cryosparc website.

I believe we use the GPFS filesystem, but this may not be the case. Please see below (no admin rights, so the second command did not work, but it looks like the DB is located on its own logical drive, and its current size is 1.2T).

[pletnevs@ai-hpccryoprd3 svc_hpccryoprd3]$ df -h /var/lib/CryoSPARCv3/
grep "$(df /var/lib/CryoSPARCv3/ | tail -n 1 | awk '{print $NF}') " /proc/mounts
Filesystem                           Size  Used Avail Use% Mounted on
/dev/mapper/ai--hpccryoprd3-mongodb  1.8T  1.2T  642G  65% /var/lib/CryoSPARCv3
/dev/mapper/ai--hpccryoprd3-mongodb /var/lib/CryoSPARCv3 xfs rw,seclabel,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0

[pletnevs@ai-hpccryoprd3 svc_hpccryoprd3]$ du -sh /var/lib/CryoSPARCv3/
du: cannot read directory '/var/lib/CryoSPARCv3/': Permission denied
12K	/var/lib/CryoSPARCv3/

wtempel · November 13, 2024, 7:57pm

@Sergei Your database seems to currently use 1.2 TB of storage space. At this size, frequent filesystem snapshots (and the snaphosts’ copy to a “safe” location) may be the only feasible strategy for the recovery from a possible failure of the primary database, but it is not clear whether such a strategy has been implemented at your site or whether your site’s infrastructure would support it. A high frequency of snapshots is required because a database backup, to be useful for recovery, must not be older than the latest changes in CryoSPARC project directories.

I am not aware of a formal size limit of the CryoSPARC database with respect to CryoSPARC function.

This guide discusses methods for the reduction of the database’s storage footprint.

Sergei · November 13, 2024, 8:49pm

Hi @witempel,

Thank you for the link. Our computer center generates daily snapshots and stores them for the last 14 days. If CryoSPARC does not have a formal size limit for the database, I hope that 14 snapshots should provide reasonable protection against database failure.

I appreciate your help!
Sergei