CryoSPARC Live jobs hanging

EdLowe · October 16, 2024, 8:12pm

Following updating to cryosparc 4.6.0 some users have experienced CS Live jobs hanging.
CS is running on a cluster with an instance running in a kubernetes pod for each research group. SSD caching is being carried out using a common beegfs SSD pool, in case that is relevant. This was working prior to the update.
This has happened to a user this evening and has left the interface unresponsive. The job seems to have stalled during extraction, reporting the following text - any advice would be much appreciated:

[CPU: 3.04 GB]

— Transition from ctf to pick done in 0.31s

[CPU: 3.04 GB]

Extracting Blob Picked Particles…
[CPU: 3.04 GB]

Extracting from S16/motioncorrected/FoilHole_16399986_Data_16391083_21_20241016_162239_Fractions_patch_aligned_doseweighted.mrc : 618 particles (97 rejected near edges)
[CPU: 3.21 GB]

Extracted particles in 0.90s
[CPU: 3.21 GB]

Wrote out extracted particles in 0.24s
[CPU: 3.04 GB]

— Transition from pick to extract done in 1.21s

[CPU: 3.04 GB]

Extracting Manual Picked Particles…
[CPU: 3.04 GB]

— Transition from extract to extract_manual done in 0.03s

[CPU: 3.04 GB]

Applying thresholds…
[CPU: 3.04 GB]

— Transition from extract_manual to ready done in 0.03s

[CPU: 1.70 GB]

Traceback (most recent call last):
File “/mnt/beegfs/software/structural_biology/release/cryosparc/ahel/cryosparc/cryosparc_worker/cryosparc_tools/cryosparc/command.py”, line 105, in func
with make_json_request(self, “/api”, data=data, _stacklevel=4) as request:
File “/mnt/beegfs/software/structural_biology/release/cryosparc/ahel/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/contextlib.py”, line 135, in enter
return next(self.gen)
File “/mnt/beegfs/software/structural_biology/release/cryosparc/ahel/cryosparc/cryosparc_worker/cryosparc_tools/cryosparc/command.py”, line 226, in make_request
raise CommandError(error_reason, url=url, code=code, data=resdata)
cryosparc_tools.cryosparc.errors.CommandError: *** (http://cryosparc.cosmic:30060/api, code 500) Timeout Error

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File “cryosparc_master/cryosparc_compute/jobs/rtp_workers/run.py”, line 383, in cryosparc_master.cryosparc_compute.jobs.rtp_workers.run.rtp_worker
File “/mnt/beegfs/software/structural_biology/release/cryosparc/ahel/cryosparc/cryosparc_worker/cryosparc_compute/jobs/rtp_workers/rtp_common.py”, line 46, in update_exposure
get_rtp_cli().update_exposure_property(project_uid, session_uid, exposure_uid, attrs, operation)
File “/mnt/beegfs/software/structural_biology/release/cryosparc/ahel/cryosparc/cryosparc_worker/cryosparc_tools/cryosparc/command.py”, line 108, in func
raise CommandError(
cryosparc_tools.cryosparc.errors.CommandError: *** (http://cryosparc.cosmic:30060, code 500) Encounted error from JSONRPC function “update_exposure_property” with params (‘P95’, ‘S16’, 488, {‘in_progress’: False}, ‘$set’)
[CPU: 1.70 GB]

Traceback (most recent call last):
File “/mnt/beegfs/software/structural_biology/release/cryosparc/ahel/cryosparc/cryosparc_worker/cryosparc_tools/cryosparc/command.py”, line 105, in func
with make_json_request(self, “/api”, data=data, _stacklevel=4) as request:
File “/mnt/beegfs/software/structural_biology/release/cryosparc/ahel/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/contextlib.py”, line 135, in enter
return next(self.gen)
File “/mnt/beegfs/software/structural_biology/release/cryosparc/ahel/cryosparc/cryosparc_worker/cryosparc_tools/cryosparc/command.py”, line 226, in make_request
raise CommandError(error_reason, url=url, code=code, data=resdata)
cryosparc_tools.cryosparc.errors.CommandError: *** (http://cryosparc.cosmic:30060/api, code 500) Timeout Error

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File “cryosparc_master/cryosparc_compute/jobs/rtp_workers/run.py”, line 383, in cryosparc_master.cryosparc_compute.jobs.rtp_workers.run.rtp_worker
File “/mnt/beegfs/software/structural_biology/release/cryosparc/ahel/cryosparc/cryosparc_worker/cryosparc_compute/jobs/rtp_workers/rtp_common.py”, line 46, in update_exposure
get_rtp_cli().update_exposure_property(project_uid, session_uid, exposure_uid, attrs, operation)
File “/mnt/beegfs/software/structural_biology/release/cryosparc/ahel/cryosparc/cryosparc_worker/cryosparc_tools/cryosparc/command.py”, line 108, in func
raise CommandError(
cryosparc_tools.cryosparc.errors.CommandError: *** (http://cryosparc.cosmic:30060, code 500) Encounted error from JSONRPC function “update_exposure_property” with params (‘P95’, ‘S16’, 488, {‘in_progress’: False}, ‘$set’)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/mnt/beegfs/software/structural_biology/release/cryosparc/ahel/cryosparc/cryosparc_worker/cryosparc_tools/cryosparc/command.py”, line 105, in func
with make_json_request(self, “/api”, data=data, _stacklevel=4) as request:
File “/mnt/beegfs/software/structural_biology/release/cryosparc/ahel/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/contextlib.py”, line 135, in enter
return next(self.gen)
File “/mnt/beegfs/software/structural_biology/release/cryosparc/ahel/cryosparc/cryosparc_worker/cryosparc_tools/cryosparc/command.py”, line 226, in make_request
raise CommandError(error_reason, url=url, code=code, data=resdata)
cryosparc_tools.cryosparc.errors.CommandError: *** (http://cryosparc.cosmic:30060/api, code 500) Timeout Error

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File “cryosparc_master/cryosparc_compute/run.py”, line 116, in cryosparc_master.cryosparc_compute.run.main
File “cryosparc_master/cryosparc_compute/jobs/rtp_workers/run.py”, line 390, in cryosparc_master.cryosparc_compute.jobs.rtp_workers.run.rtp_worker
File “/mnt/beegfs/software/structural_biology/release/cryosparc/ahel/cryosparc/cryosparc_worker/cryosparc_compute/jobs/rtp_workers/rtp_common.py”, line 46, in update_exposure
get_rtp_cli().update_exposure_property(project_uid, session_uid, exposure_uid, attrs, operation)
File “/mnt/beegfs/software/structural_biology/release/cryosparc/ahel/cryosparc/cryosparc_worker/cryosparc_tools/cryosparc/command.py”, line 108, in func
raise CommandError(
cryosparc_tools.cryosparc.errors.CommandError: *** (http://cryosparc.cosmic:30060, code 500) Encounted error from JSONRPC function “update_exposure_property” with params (‘P95’, ‘S16’, 488, {‘fail_count’: 1}, ‘$inc’)

wtempel · October 16, 2024, 10:00pm

@EdLowe May I ask for some additional details

What are the resource limits (RAM, CPU cores) imposed on
- the CryoSPARC master container
- the CryoSPARC worker hosts(?)/containers(?)
Are the CryoSPARC workers also implemented as containers?

What are the outputs of the commands (on the relevant CryoSPARC master)

cryosparcm rtpcli "get_exposure('P95', 'S16', 488)"
cryosparcm cli "get_scheduler_targets()"

EdLowe:

cryosparc_tools.cryosparc.errors.CommandError: *** (http://cryosparc.cosmic:30060/api, code 500)

Did the container(s) for CryoSPARC master processes run continuously?
Please can you email us the tgz file created by running the command
cryosparcm snaplogs. I will send you a direct message with the email address.

EdLowe · October 16, 2024, 10:25pm

Thanks for your reply:
1:
Master (container) RAM 16 Gb CPU 4
Worker: RAM 120 Gb, CPU 40 (the queue does not limit CPU access)

2: No, the workers are not implemented as containers. The worker is installed on a shared beegfs filesystem and executed on cluster GPU nodes.

3:

ahel_cryosparc@cryosparc:/mnt/beegfs/software/structural_biology/release/cryosparc/ahel/cryosparc/cryosparc_master/bin$ ./cryosparcm rtpcli "get_exposure('P95', 'S16', 488)"
{'abs_file_path': '/mnt/beegfs/cosmic/data/arctica1/epu-20241016_8-3-db121024/movies/FoilHole_16399986_Data_16391083_21_20241016_162239_Fractions.mrc', 'attributes': {'astigmatism': 175.16291989132878, 'astigmatism_angle': -35.53548352548414, 'average_defocus': 16492.180177315055, 'blob_pick_score_median': 0.30580076575279236, 'check_at': 1729095667.7847838, 'ctf_at': 1729095674.5045793, 'ctf_fit_to_A': 3.721162320319277, 'defocus_range': 1146.5070838710326, 'df_tilt_angle': 10.817474835966932, 'extract_at': 1729095676.0119655, 'found_at': 1729095667.758464, 'ice_thickness_rel': 0.979425608741885, 'manual_extract_at': 1729095676.0572824, 'max_intra_frame_motion': 0.2824234546648528, 'motion_at': 1729095670.5603945, 'phase_shift': 0.0, 'pick_at': 1729095674.8073857, 'ready_at': 1729095676.083776, 'template_pick_score_median': 0, 'thumbs_at': 1729095670.798946, 'total_blob_picks': 715, 'total_extracted_particles': 618, 'total_extracted_particles_blob': 618, 'total_extracted_particles_manual': 0, 'total_extracted_particles_template': 0, 'total_manual_picks': 0, 'total_motion_dist': 3.3280183034117985, 'total_template_picks': 0}, 'deleted': False, 'discovered_at': 1729095658.5216014, 'exp_group_id': 1, 'fail_count': 0, 'fail_reason': '', 'failed': False, 'groups': {'exposure': {'background_blob': {'binfactor': [4], 'idx': [0], 'path': ['S16/motioncorrected/FoilHole_16399986_Data_16391083_21_20241016_162239_Fractions_background.mrc'], 'psize_A': [0.949999988079071], 'shape': [[4096, 4096]]}, 'ctf': {'accel_kv': [200.0], 'amp_contrast': [0.10000000149011612], 'cross_corr_ctffind4': [0.0], 'cs_mm': [2.700000047683716], 'ctf_fit_to_A': [3.7211623191833496], 'df1_A': [16579.76171875], 'df2_A': [16404.599609375], 'df_angle_rad': [-0.6202111840248108], 'exp_group_id': [1], 'fig_of_merit_gctf': [0.0], 'path': ['S16/ctfestimated/FoilHole_16399986_Data_16391083_21_20241016_162239_Fractions_patch_aligned_ctf_spline.npy'], 'phase_shift_rad': [0.0], 'type': ['spline']}, 'ctf_stats': {'cross_corr': [0.0], 'ctf_fit_to_A': [3.7211623191833496], 'df_range': [[16238.9775390625, 17385.484375]], 'df_tilt_normal': [[-0.040270812809467316, 0.18678441643714905]], 'diag_image_path': ['S16/ctfestimated/FoilHole_16399986_Data_16391083_21_20241016_162239_Fractions_patch_aligned_ctf_diag_2D.mrc'], 'fit_data_path': ['S16/ctfestimated/FoilHole_16399986_Data_16391083_21_20241016_162239_Fractions_patch_aligned_ctf_diag_plt.npy'], 'ice_thickness_rel': [0.9794256091117859], 'spectrum_dim': [1152], 'type': ['spline']}, 'gain_ref_blob': {'flip_x': [0], 'flip_y': [0], 'idx': [0], 'path': [''], 'rotate_num': [0], 'shape': [[0, 0]]}, 'micrograph_blob': {'format': ['MRC/2'], 'idx': [0], 'import_sig': ['0'], 'is_background_subtracted': [1], 'path': ['S16/motioncorrected/FoilHole_16399986_Data_16391083_21_20241016_162239_Fractions_patch_aligned_doseweighted.mrc'], 'psize_A': [0.949999988079071], 'shape': [[4096, 4096]], 'vmax': [267505.46875], 'vmin': [-264593.15625]}, 'micrograph_blob_non_dw': {'format': ['MRC/2'], 'idx': [0], 'import_sig': ['0'], 'is_background_subtracted': [1], 'path': ['S16/motioncorrected/FoilHole_16399986_Data_16391083_21_20241016_162239_Fractions_patch_aligned.mrc'], 'psize_A': [0.949999988079071], 'shape': [[4096, 4096]], 'vmax': [267505.46875], 'vmin': [-264593.15625]}, 'micrograph_blob_thumb': {'format': ['MRC/2'], 'idx': [0], 'import_sig': ['0'], 'is_background_subtracted': [1], 'path': ['S16/motioncorrected/FoilHole_16399986_Data_16391083_21_20241016_162239_Fractions_patch_aligned_thumb.mrc'], 'psize_A': [0.949999988079071], 'shape': [[1024, 1024]], 'vmax': [-264593.15625], 'vmin': [-264593.15625]}, 'movie_blob': {'format': ['MRC/1'], 'has_defect_file': [0], 'import_sig': ['1022148816241583308'], 'is_gain_corrected': [1], 'path': ['S16/import_movies/FoilHole_16399986_Data_16391083_21_20241016_162239_Fractions.mrc'], 'psize_A': [0.949999988079071], 'shape': [[36, 4096, 4096]]}, 'mscope_params': {'accel_kv': [200.0], 'beam_shift': [[0.0, 0.0]], 'beam_shift_known': [0], 'cs_mm': [2.700000047683716], 'defect_path': [''], 'exp_group_id': [1], 'neg_stain': [0], 'phase_plate': [0], 'total_dose_e_per_A2': [40.0]}, 'rigid_motion': {'frame_end': [36], 'frame_start': [0], 'idx': [0], 'path': ['S16/motioncorrected/FoilHole_16399986_Data_16391083_21_20241016_162239_Fractions_traj.npy'], 'psize_A': [0.949999988079071], 'type': ['rigid'], 'zero_shift_frame': [0]}, 'spline_motion': {'frame_end': [36], 'frame_start': [0], 'idx': [0], 'path': ['S16/motioncorrected/FoilHole_16399986_Data_16391083_21_20241016_162239_Fractions_bending_traj.npy'], 'psize_A': [0.949999988079071], 'type': ['spline'], 'zero_shift_frame': [0]}, 'uid': ['8876671285173552504']}, 'particle_blob': {'count': 715, 'fields': ['uid', 'location/micrograph_uid', 'location/exp_group_id', 'location/micrograph_path', 'location/micrograph_shape', 'location/micrograph_psize_A', 'location/center_x_frac', 'location/center_y_frac', 'pick_stats/ncc_score', 'pick_stats/power', 'pick_stats/template_idx', 'pick_stats/angle_rad', 'location/min_dist_A'], 'path': 'S16/pick/FoilHole_16399986_Data_16391083_21_20241016_162239_Fractions_patch_aligned_doseweighted_particle_blob.cs'}, 'particle_deep': {}, 'particle_extracted': [{'count': 618, 'fields': ['uid', 'pick_stats/ncc_score', 'pick_stats/power', 'pick_stats/template_idx', 'pick_stats/angle_rad', 'location/min_dist_A', 'blob/path', 'blob/idx', 'blob/shape', 'blob/psize_A', 'blob/sign', 'blob/import_sig', 'location/micrograph_uid', 'location/exp_group_id', 'location/micrograph_path', 'location/micrograph_shape', 'location/micrograph_psize_A', 'location/center_x_frac', 'location/center_y_frac', 'ctf/type', 'ctf/exp_group_id', 'ctf/accel_kv', 'ctf/cs_mm', 'ctf/amp_contrast', 'ctf/df1_A', 'ctf/df2_A', 'ctf/df_angle_rad', 'ctf/phase_shift_rad', 'ctf/scale', 'ctf/scale_const', 'ctf/shift_A', 'ctf/tilt_A', 'ctf/trefoil_A', 'ctf/tetra_A', 'ctf/anisomag', 'ctf/bfactor'], 'output_shape': 270, 'path': 'S16/extract/FoilHole_16399986_Data_16391083_21_20241016_162239_Fractions_patch_aligned_doseweighted_particle_blob_extracted_270.cs', 'picker_type': 'blob'}], 'particle_manual': {}, 'particle_manual_extracted': [{'count': 0, 'fields': ['uid'], 'output_shape': 270, 'path': 'S16/extract/FoilHole_16399986_Data_16391083_21_20241016_162239_Fractions_patch_aligned_doseweighted_particle_manual_extracted_270.cs', 'picker_type': 'manual'}], 'particle_template': {}}, 'in_progress': False, 'manual_reject': False, 'micrograph_psize': 0.949999988079071, 'micrograph_shape': [4096, 4096], 'parameter_version': 1, 'picker_type': 'blob', 'preview_img_1x': ['670fe7f6036a5e39ed9f8452', '670fe7f6036a5e39ed9f8454', '670fe7f6036a5e39ed9f8456'], 'preview_img_2x': ['670fe7f6036a5e39ed9f8458', '670fe7f6036a5e39ed9f845a', '670fe7f6036a5e39ed9f845c'], 'priority': 0, 'project_uid': 'P95', 'session_uid': 'S16', 'size': 1207996416, 'stage': 'ready', 'test': False, 'threshold_reject': False, 'thumb_shape': [1024, 1024], 'uid': 488, 'worker_juid': None}
ahel_cryosparc@cryosparc:/mnt/beegfs/software/structural_biology/release/cryosparc/ahel/cryosparc/cryosparc_master/bin$ ./cryosparcm cli "get_scheduler_targets()"
[{'cache_path': '/mnt/beegfs/fast_cache/ahel', 'cache_quota_mb': 8000000, 'cache_reserve_mb': 8000000, 'custom_var_names': [], 'custom_vars': {}, 'desc': None, 'hostname': 'cosmic gpu', 'lane': 'cosmic gpu', 'name': 'cosmic gpu', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }}            - the complete command string to run the job\n## {{ num_cpu }}            - the number of CPUs needed\n## {{ num_gpu }}            - the number of GPUs needed. \n##                            Note: the code will use this many GPUs starting from dev id 0\n##                                  the cluster scheduler or this script have the responsibility\n##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n##                                  using the correct cluster-allocated GPUs.\n## {{ ram_gb }}             - the amount of RAM needed in GB\n## {{ job_dir_abs }}        - absolute path to the job directory\n## {{ project_dir_abs }}    - absolute path to the project dir\n## {{ job_log_path_abs }}   - absolute path to the log file for the job\n## {{ worker_bin_path }}    - absolute path to the cryosparc worker command\n## {{ run_args }}           - arguments to be passed to cryosparcw run\n## {{ project_uid }}        - uid of the project\n## {{ job_uid }}            - uid of the job\n## {{ job_creator }}        - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## What follows is a simple SLURM script:\n\n#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH -c {{ num_cpu }}\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH -p gpu\n#SBATCH --mem={{ (ram_gb*2000)|int }}MB             \n#SBATCH -o {{ job_dir_abs }}/slurm.out\n#SBATCH -e {{ job_dir_abs }}/slurm.err\n\nexport PYCUDA_CACHE_DIR=/tmp/cudacache_{{ job_uid }}_$$\nmkdir -p $PYCUDA_CACHE_DIR\n\nexport LD_LIBRARY_PATH=/mnt/service/software/packages/gcc/gcc-5.5.0/lib/x86_64-linux-gnu/:/mnt/service/software/packages/gcc/gcc-5.5.0/lib:/mnt/service/software/packages/gcc/gcc-10.3.0/lib64:/mnt/service/software/lib\n\n{{ run_cmd }}\n\nrm -r $PYCUDA_CACHE_DIR\n', 'send_cmd_tpl': '{{ command }}', 'title': 'cosmic gpu', 'tpl_vars': ['job_uid', 'run_cmd', 'ram_gb', 'num_cpu', 'project_uid', 'cluster_job_id', 'job_dir_abs', 'num_gpu', 'project_dir_abs', 'cryosparc_username', 'job_creator', 'run_args', 'command', 'job_log_path_abs', 'worker_bin_path'], 'type': 'cluster', 'worker_bin_path': '/mnt/beegfs/software/structural_biology/release/cryosparc/ahel/cryosparc/cryosparc_worker/bin/cryosparcw'}, {'cache_path': '/mnt/beegfs/fast_cache/ahel', 'cache_quota_mb': 8000000, 'cache_reserve_mb': 8000000, 'custom_var_names': [], 'custom_vars': {}, 'desc': None, 'hostname': 'cosmic gpu large', 'lane': 'cosmic gpu large', 'name': 'cosmic gpu large', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }}            - the complete command string to run the job\n## {{ num_cpu }}            - the number of CPUs needed\n## {{ num_gpu }}            - the number of GPUs needed. \n##                            Note: the code will use this many GPUs starting from dev id 0\n##                                  the cluster scheduler or this script have the responsibility\n##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n##                                  using the correct cluster-allocated GPUs.\n## {{ ram_gb }}             - the amount of RAM needed in GB\n## {{ job_dir_abs }}        - absolute path to the job directory\n## {{ project_dir_abs }}    - absolute path to the project dir\n## {{ job_log_path_abs }}   - absolute path to the log file for the job\n## {{ worker_bin_path }}    - absolute path to the cryosparc worker command\n## {{ run_args }}           - arguments to be passed to cryosparcw run\n## {{ project_uid }}        - uid of the project\n## {{ job_uid }}            - uid of the job\n## {{ job_creator }}        - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## What follows is a simple SLURM script:\n\n#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH -c {{ num_cpu }}\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH -p gpu_large\n#SBATCH --mem=120000\n#SBATCH -o {{ job_dir_abs }}/slurm.out\n#SBATCH -e {{ job_dir_abs }}/slurm.err\n\nexport PYCUDA_CACHE_DIR=/tmp/cudacache_{{ job_uid }}_$$\nmkdir -p $PYCUDA_CACHE_DIR\n\nexport LD_LIBRARY_PATH=/mnt/service/software/packages/gcc/gcc-5.5.0/lib/x86_64-linux-gnu/:/mnt/service/software/packages/gcc/gcc-5.5.0/lib:/mnt/service/software/packages/gcc/gcc-10.3.0/lib64:/mnt/service/software/lib\n\n{{ run_cmd }}\n\nrm -r $PYCUDA_CACHE_DIR\n', 'send_cmd_tpl': '{{ command }}', 'title': 'cosmic gpu large', 'tpl_vars': ['job_uid', 'run_cmd', 'ram_gb', 'num_cpu', 'project_uid', 'cluster_job_id', 'job_dir_abs', 'num_gpu', 'project_dir_abs', 'cryosparc_username', 'job_creator', 'run_args', 'command', 'job_log_path_abs', 'worker_bin_path'], 'type': 'cluster', 'worker_bin_path': '/mnt/beegfs/software/structural_biology/release/cryosparc/ahel/cryosparc/cryosparc_worker/bin/cryosparcw'}, {'cache_path': None, 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': [], 'custom_vars': {}, 'desc': None, 'hostname': 'cosmic gpu small', 'lane': 'cosmic gpu small', 'name': 'cosmic gpu small', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }}            - the complete command string to run the job\n## {{ num_cpu }}            - the number of CPUs needed\n## {{ num_gpu }}            - the number of GPUs needed. \n##                            Note: the code will use this many GPUs starting from dev id 0\n##                                  the cluster scheduler or this script have the responsibility\n##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n##                                  using the correct cluster-allocated GPUs.\n## {{ ram_gb }}             - the amount of RAM needed in GB\n## {{ job_dir_abs }}        - absolute path to the job directory\n## {{ project_dir_abs }}    - absolute path to the project dir\n## {{ job_log_path_abs }}   - absolute path to the log file for the job\n## {{ worker_bin_path }}    - absolute path to the cryosparc worker command\n## {{ run_args }}           - arguments to be passed to cryosparcw run\n## {{ project_uid }}        - uid of the project\n## {{ job_uid }}            - uid of the job\n## {{ job_creator }}        - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## What follows is a simple SLURM script:\n\n#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH -c {{ num_cpu }}\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH -p gpu_small\n#SBATCH --mem={{ (num_gpu*60000)|int }}MB           \n#SBATCH -o {{ job_dir_abs }}/slurm.out\n#SBATCH -e {{ job_dir_abs }}/slurm.err\n\nexport PYCUDA_CACHE_DIR=/tmp/cudacache_{{ job_uid }}_$$\nmkdir -p $PYCUDA_CACHE_DIR\n\nexport LD_LIBRARY_PATH=/mnt/service/software/packages/gcc/gcc-5.5.0/lib/x86_64-linux-gnu/:/mnt/service/software/packages/gcc/gcc-5.5.0/lib:/mnt/service/software/packages/gcc/gcc-10.3.0/lib64:/mnt/service/software/packages/cuda/cuda-11.4.1/lib64:/mnt/service/software/lib\n\n{{ run_cmd }}\n\nrm -r $PYCUDA_CACHE_DIR\n', 'send_cmd_tpl': '{{ command }}', 'title': 'cosmic gpu small', 'tpl_vars': ['job_uid', 'run_cmd', 'ram_gb', 'num_cpu', 'project_uid', 'cluster_job_id', 'job_dir_abs', 'num_gpu', 'project_dir_abs', 'cryosparc_username', 'job_creator', 'run_args', 'command', 'job_log_path_abs', 'worker_bin_path'], 'type': 'cluster', 'worker_bin_path': '/mnt/beegfs/software/structural_biology/release/cryosparc/ahel/cryosparc/cryosparc_worker/bin/cryosparcw'}, {'cache_path': None, 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': [], 'custom_vars': {}, 'desc': None, 'hostname': 'cosmic gpu verylarge', 'lane': 'cosmic gpu verylarge', 'name': 'cosmic gpu verylarge', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }}            - the complete command string to run the job\n## {{ num_cpu }}            - the number of CPUs needed\n## {{ num_gpu }}            - the number of GPUs needed. \n##                            Note: the code will use this many GPUs starting from dev id 0\n##                                  the cluster scheduler or this script have the responsibility\n##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n##                                  using the correct cluster-allocated GPUs.\n## {{ ram_gb }}             - the amount of RAM needed in GB\n## {{ job_dir_abs }}        - absolute path to the job directory\n## {{ project_dir_abs }}    - absolute path to the project dir\n## {{ job_log_path_abs }}   - absolute path to the log file for the job\n## {{ worker_bin_path }}    - absolute path to the cryosparc worker command\n## {{ run_args }}           - arguments to be passed to cryosparcw run\n## {{ project_uid }}        - uid of the project\n## {{ job_uid }}            - uid of the job\n## {{ job_creator }}        - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## What follows is a simple SLURM script:\n\n#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH -c {{ num_cpu }}\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH -p gpu_large\n#SBATCH --mem=240000\n#SBATCH --constraint=cryosparc\n#SBATCH -o {{ job_dir_abs }}/slurm.out\n#SBATCH -e {{ job_dir_abs }}/slurm.err\n\nexport PYCUDA_CACHE_DIR=/tmp/cudacache_{{ job_uid }}_$$\nmkdir -p $PYCUDA_CACHE_DIR\n\nexport LD_LIBRARY_PATH=/mnt/service/software/packages/gcc/gcc-5.5.0/lib/x86_64-linux-gnu/:/mnt/service/software/packages/gcc/gcc-5.5.0/lib:/mnt/service/software/packages/gcc/gcc-10.3.0/lib64:/mnt/service/software/packages/cuda/cuda-11.4.1/lib64:/mnt/service/software/lib\n\n{{ run_cmd }}\n\nrm -r $PYCUDA_CACHE_DIR\n', 'send_cmd_tpl': '{{ command }}', 'title': 'cosmic gpu verylarge', 'tpl_vars': ['job_uid', 'run_cmd', 'ram_gb', 'num_cpu', 'project_uid', 'cluster_job_id', 'job_dir_abs', 'num_gpu', 'project_dir_abs', 'cryosparc_username', 'job_creator', 'run_args', 'command', 'job_log_path_abs', 'worker_bin_path'], 'type': 'cluster', 'worker_bin_path': '/mnt/beegfs/software/structural_biology/release/cryosparc/ahel/cryosparc/cryosparc_worker/bin/cryosparcw'}]

4: Yes, the master container ran continuously although it has since been restarted.

5: I have sent the .tgz file as requested.

Thanks again.

wtempel · October 29, 2024, 2:13pm

@EdLowe Please can you check if suggestions in Cryosparc v4.6.0 2D job never finish - #36 by wtempel help with your issue and let us know what you find.