Hi all,
I have been trying to carry out Non-uniform refinement for a while, and no matter how I carry out the refinement, it fails after running for a while. The following is one of the logs that I get back. I am no expert in cryosparc, and any help regarding troubleshooting this, big or small, is very much appreciated.
MAIN PROCESS PID 112094
========= now starting main process at 2025-02-16 22:44:24.762040
refine.newrun cryosparc_compute.jobs.jobregister
MONITOR PROCESS PID 112096
========= monitor process now waiting for main process
========= sending heartbeat at 2025-02-16 22:44:26.640575
========= sending heartbeat at 2025-02-16 22:44:36.654210
***************************************************************
Transparent hugepages setting: [always] madvise never
Running job J92 of type nonuniform_refine_new
Running job on hostname %s a100_72h
Allocated Resources : {'fixed': {'SSD': True}, 'hostname': 'a100_72h', 'lane': 'a100_72h', 'lane_type': 'cluster', 'license': True, 'licenses_acquired': 1, 'slots': {'CPU': [0, 1, 2, 3], 'GPU': [0], 'RAM': [0, 1, 2]}, 'target': {'cache_path': '/cryo/jspangl4/cryosparc/cache', 'cache_quota_mb': 12000000, 'cache_reserve_mb': 80000, 'custom_var_names': [], 'custom_vars': {}, 'desc': None, 'hostname': 'a100_72h', 'lane': 'a100_72h', 'name': 'a100_72h', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/bin/bash\n###!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }} - the complete command string to run the job\n## {{ num_cpu }} - the number of CPUs needed\n## {{ num_gpu }} - the number of GPUs needed. \n## Note: the code will use this many GPUs starting from dev id 0\n## the cluster scheduler or this script have the responsibility\n## of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n## using the correct cluster-allocated GPUs.\n## {{ ram_gb }} - the amount of RAM needed in GB\n## {{ job_dir_abs }} - absolute path to the job directory\n## {{ project_dir_abs }} - absolute path to the project dir\n## {{ job_log_path_abs }} - absolute path to the log file for the job\n## {{ worker_bin_path }} - absolute path to the cryosparc worker command\n## {{ run_args }} - arguments to be passed to cryosparcw run\n## {{ project_uid }} - uid of the project\n## {{ job_uid }} - uid of the job\n##\n## What follows is a simple SLURM script:\n\t{%- if num_gpu == 0 %}\n#SBATCH --partition=parallel\n#SBATCH --account=jspangl4\n#SBATCH --qos=normal\n\t{%- else %}\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH --partition=a100\n#SBATCH --qos=qos_gpu_cryo\n#SBATCH --account=jspangl4_gpu\n\t{%- endif %}\n\n\n\n {%- if num_gpu == 0 %}\n#SBATCH --ntasks-per-node={{ num_cpu }}\n {%- if ram_gb/(4*num_cpu) > 2 %}\n#SBATCH --cpus-per-task=3\n {%- elif ram_gb/(4*num_cpu) > 1 %}\n#SBATCH --cpus-per-task=2\n {%- else %}\n#SBATCH --cpus-per-task=2\n {%- endif %}\n {%- else %}\n#SBATCH --ntasks-per-node={{ num_cpu }}\n## {%- if ram_gb/(4*num_cpu*((12*num_gpu/num_cpu)|int)) > 2 %}\n## #SBATCH --cpus-per-task={{ (36*num_gpu/num_cpu)|int }}\n## {%- elif ram_gb/(4*num_cpu*((12*num_gpu/num_cpu)|int)) > 1 %}\n## #SBATCH --cpus-per-task={{ (24*num_gpu/num_cpu)|int }}\n## {%- else %}\n#SBATCH --cpus-per-task={{ (12*num_gpu/num_cpu)|int }} \n## {%- endif %}\n {%- endif %}\n\n\n\n\n#SBATCH --exclude=/cryo/jspangl4/cryosparc/cluster_configs/.nodes.list\n#SBATCH --job-name=cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH -o {{ job_dir_abs }}/cryosparc_{{ project_uid }}_{{ job_uid }}.out\n#SBATCH -e {{ job_dir_abs }}/cryosparc_{{ project_uid }}_{{ job_uid }}.err\n#SBATCH --time=72:00:00\n\n## #SBATCH --ntasks-per-node={{ num_gpu * 12}} # each GPU means 12 CPU in A100 queue, and each GPU means 16 CPU in ICA100 queue\n\n\n#module restore\n#module load cuda/11.8.0\numask 0027\n\n\n\navailable_devs=""\nfor devidx in $(seq 0 15);\ndo\n if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then\n if [[ -z "$available_devs" ]] ; then\n available_devs=$devidx\n else\n available_devs=$available_devs,$devidx\n fi\n fi\ndone\nexport CUDA_VISIBLE_DEVICES=$available_devs\necho $CUDA_VISIBLE_DEVICES\n{{ run_cmd }}\n\n\n\n\n\n\n', 'send_cmd_tpl': 'ssh devcryo.cm.cluster {{ command }}', 'title': 'a100_72h', 'tpl_vars': ['num_gpu', 'project_uid', 'job_log_path_abs', 'job_dir_abs', 'run_cmd', 'command', 'ram_gb', 'worker_bin_path', 'job_uid', 'num_cpu', 'run_args', 'project_dir_abs', 'cluster_job_id'], 'type': 'cluster', 'worker_bin_path': '/cryo/jspangl4/cryosparc/cryosparc_worker/bin/cryosparcw'}}
2025-02-16 22:44:41,094 run_with_executor INFO | Resolving 8117 source path(s) for caching
========= sending heartbeat at 2025-02-16 22:44:46.668412
========= sending heartbeat at 2025-02-16 22:44:56.678330
========= sending heartbeat at 2025-02-16 22:45:06.693349
========= sending heartbeat at 2025-02-16 22:45:16.707995
========= sending heartbeat at 2025-02-16 22:45:26.722962
========= sending heartbeat at 2025-02-16 22:45:36.738337
2025-02-16 22:45:38,774 run_with_executor INFO | Resolved 8117 sources in 57.68 seconds
2025-02-16 22:45:38,791 allocate INFO | Cache allocation start. Active run IDs: P4-J90-1739659731, P5-J2-1739759385, P4-J92-1739781839
2025-02-16 22:45:39,198 refresh INFO | Refreshed cache drive in 0.41 seconds
2025-02-16 22:45:39,718 cleanup_junk_files INFO | Removed 1 invalid item(s) in the cache
2025-02-16 22:45:39,802 refresh INFO | Refreshed cache drive in 0.08 seconds
2025-02-16 22:45:39,813 allocate INFO | Deleted 0 cached files, encountered 0 errors
2025-02-16 22:45:39,813 allocate INFO | Allocated 0 stub cache files; creating links
2025-02-16 22:45:40,490 allocate INFO | Cache allocation complete
2025-02-16 22:45:40,491 run_with_executor INFO | Cache allocation ran in 1.70 seconds
2025-02-16 22:45:40,491 run_with_executor INFO | Found 8117 SSD hit(s)
2025-02-16 22:45:40,491 run_with_executor INFO | Requested files successfully cached to SSD
2025-02-16 22:45:40,900 run_with_executor INFO | SSD cache complete
========= sending heartbeat at 2025-02-16 22:45:46.752398
========= sending heartbeat at 2025-02-16 22:45:56.766626
========= sending heartbeat at 2025-02-16 22:46:06.781204
========= sending heartbeat at 2025-02-16 22:46:16.795327
========= sending heartbeat at 2025-02-16 22:46:26.809683
WARNING: io_uring support disabled (not supported by kernel), I/O performance may degrade
========= sending heartbeat at 2025-02-16 22:46:36.824481
========= sending heartbeat at 2025-02-16 22:46:46.833330
gpufft: creating new cufft plan (plan id 0 pid 112094)
gpu_id 0
ndims 2
dims 640 640 0
inembed 640 640 0
istride 1
idist 409600
onembed 640 640 0
ostride 1
odist 409600
batch 500
type C2C
wkspc automatic
Python traceback:
gpufft: creating new cufft plan (plan id 1 pid 112094)
gpu_id 0
ndims 2
dims 640 640 0
inembed 640 640 0
istride 1
idist 409600
onembed 640 640 0
ostride 1
odist 409600
batch 500
type C2C
wkspc automatic
Python traceback:
HOST ALLOCATION FUNCTION: using numba.cuda.pinned_array
/cryo/jspangl4/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/nvrtc.py:257: UserWarning: NVRTC log messages whilst compiling kernel:
kernel(35): warning #68-D: integer conversion resulted in a change of sign
kernel(44): warning #68-D: integer conversion resulted in a change of sign
kernel(17): warning #177-D: variable "N_I" was declared but never referenced
warnings.warn(msg)
========= sending heartbeat at 2025-02-16 22:46:56.838087
<string>:1: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected
========= sending heartbeat at 2025-02-16 22:47:06.852326
========= sending heartbeat at 2025-02-16 22:47:16.866198
========= sending heartbeat at 2025-02-16 22:47:26.873332
========= sending heartbeat at 2025-02-16 22:47:36.887588
========= sending heartbeat at 2025-02-16 22:47:46.902322
========= sending heartbeat at 2025-02-16 22:47:56.916327
========= sending heartbeat at 2025-02-16 22:48:06.924068
/cryo/jspangl4/cryosparc/cryosparc_worker/cryosparc_compute/plotutil.py:571: RuntimeWarning: divide by zero encountered in log
logabs = n.log(n.abs(fM))
========= sending heartbeat at 2025-02-16 22:48:16.938330
========= sending heartbeat at 2025-02-16 22:48:26.951326
gpufft: creating new cufft plan (plan id 2 pid 112094)
gpu_id 0
ndims 3
dims 640 640 640
inembed 640 640 642
istride 1
idist 262963200
onembed 640 640 321
ostride 1
odist 131481600
batch 1
type R2C
wkspc automatic
Python traceback:
========= sending heartbeat at 2025-02-16 22:48:36.965601
========= sending heartbeat at 2025-02-16 22:48:46.979630
gpufft: creating new cufft plan (plan id 3 pid 112094)
gpu_id 0
ndims 2
dims 640 640 0
inembed 640 642 0
istride 1
idist 410880
onembed 640 321 0
ostride 1
odist 205440
batch 500
type R2C
wkspc automatic
Python traceback:
gpufft: creating new cufft plan (plan id 4 pid 112094)
gpu_id 0
ndims 2
dims 640 640 0
inembed 640 642 0
istride 1
idist 410880
onembed 640 321 0
ostride 1
odist 205440
batch 500
type R2C
wkspc automatic
Python traceback:
/cryo/jspangl4/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/dispatcher.py:536: NumbaPerformanceWarning: Grid size 1 will likely result in GPU under-utilization due to low occupancy.
warn(NumbaPerformanceWarning(msg))
========= sending heartbeat at 2025-02-16 22:48:56.986732
========= sending heartbeat at 2025-02-16 22:49:06.993324
========= sending heartbeat at 2025-02-16 22:49:17.007527
========= sending heartbeat at 2025-02-16 22:49:27.021326
========= sending heartbeat at 2025-02-16 22:49:37.033798
========= sending heartbeat at 2025-02-16 22:49:47.047854
========= sending heartbeat at 2025-02-16 22:49:57.062213
========= sending heartbeat at 2025-02-16 22:50:07.076322
========= sending heartbeat at 2025-02-16 22:50:17.090810
========= sending heartbeat at 2025-02-16 22:50:27.104926
========= sending heartbeat at 2025-02-16 22:50:37.119444
gpufft: creating new cufft plan (plan id 5 pid 112094)
gpu_id 0
ndims 2
dims 640 640 0
inembed 640 642 0
istride 1
idist 410880
onembed 640 321 0
ostride 1
odist 205440
batch 218
type R2C
wkspc automatic
Python traceback:
<string>:1: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected
========= sending heartbeat at 2025-02-16 22:50:47.134017
========= sending heartbeat at 2025-02-16 22:50:57.148924
========= sending heartbeat at 2025-02-16 22:51:07.163325
========= sending heartbeat at 2025-02-16 22:51:17.176325
========= sending heartbeat at 2025-02-16 22:51:27.190332
========= sending heartbeat at 2025-02-16 22:51:37.203325
========= sending heartbeat at 2025-02-16 22:51:47.218324
========= sending heartbeat at 2025-02-16 22:51:57.232324
========= sending heartbeat at 2025-02-16 22:52:07.245805
========= sending heartbeat at 2025-02-16 22:52:17.260049
========= sending heartbeat at 2025-02-16 22:52:27.267338
========= sending heartbeat at 2025-02-16 22:52:37.282036
========= sending heartbeat at 2025-02-16 22:52:47.296313
gpufft: creating new cufft plan (plan id 6 pid 112094)
gpu_id 0
ndims 2
dims 640 640 0
inembed 640 642 0
istride 1
idist 410880
onembed 640 321 0
ostride 1
odist 205440
batch 219
type R2C
wkspc automatic
Python traceback:
========= sending heartbeat at 2025-02-16 22:52:57.309803
========= sending heartbeat at 2025-02-16 22:53:07.323322
========= sending heartbeat at 2025-02-16 22:53:17.337921
========= sending heartbeat at 2025-02-16 22:53:27.352129
========= sending heartbeat at 2025-02-16 22:53:37.366449
========= sending heartbeat at 2025-02-16 22:53:47.381188
========= sending heartbeat at 2025-02-16 22:53:57.395827
========= sending heartbeat at 2025-02-16 22:54:07.410808
========= sending heartbeat at 2025-02-16 22:54:17.424840
========= sending heartbeat at 2025-02-16 22:54:27.439176
========= sending heartbeat at 2025-02-16 22:54:37.453322
========= sending heartbeat at 2025-02-16 22:54:47.468237
========= sending heartbeat at 2025-02-16 22:54:57.483323
========= sending heartbeat at 2025-02-16 22:55:07.497566
========= sending heartbeat at 2025-02-16 22:55:17.512436
========= sending heartbeat at 2025-02-16 22:55:27.526681
========= sending heartbeat at 2025-02-16 22:55:37.541327
========= sending heartbeat at 2025-02-16 22:55:47.556377
========= sending heartbeat at 2025-02-16 22:55:57.570968
========= sending heartbeat at 2025-02-16 22:56:07.585579
========= sending heartbeat at 2025-02-16 22:56:17.594460
========= sending heartbeat at 2025-02-16 22:56:27.606836
========= sending heartbeat at 2025-02-16 22:56:37.621841
========= sending heartbeat at 2025-02-16 22:56:47.635323
========= sending heartbeat at 2025-02-16 22:56:57.649322
========= sending heartbeat at 2025-02-16 22:57:07.663324
========= sending heartbeat at 2025-02-16 22:57:17.677698
========= sending heartbeat at 2025-02-16 22:57:27.691535
========= sending heartbeat at 2025-02-16 22:57:37.705546
========= sending heartbeat at 2025-02-16 22:57:47.719551
========= sending heartbeat at 2025-02-16 22:57:57.733558
gpufft: creating new cufft plan (plan id 7 pid 112094)
gpu_id 0
ndims 3
dims 640 640 640
inembed 640 640 642
istride 1
idist 262963200
onembed 640 640 321
ostride 1
odist 131481600
batch 1
type R2C
wkspc manual
Python traceback:
gpufft: creating new cufft plan (plan id 8 pid 112094)
gpu_id 0
ndims 3
dims 320 320 320
inembed 320 320 161
istride 1
idist 16486400
onembed 320 320 322
ostride 1
odist 32972800
batch 1
type C2R
wkspc manual
Python traceback:
========= sending heartbeat at 2025-02-16 22:58:07.747324
<string>:1: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected
========= sending heartbeat at 2025-02-16 22:58:17.760328
<string>:1: RuntimeWarning: invalid value encountered in true_divide
========= sending heartbeat at 2025-02-16 22:58:27.774427
========= sending heartbeat at 2025-02-16 22:58:37.789205
========= sending heartbeat at 2025-02-16 22:58:47.803987
========= sending heartbeat at 2025-02-16 22:58:57.818990
gpufft: creating new cufft plan (plan id 9 pid 112094)
gpu_id 0
ndims 3
dims 640 640 640
inembed 640 640 321
istride 1
idist 131481600
onembed 640 640 642
ostride 1
odist 262963200
batch 1
type C2R
wkspc manual
Python traceback:
========= sending heartbeat at 2025-02-16 22:59:07.833975
/cryo/jspangl4/cryosparc/cryosparc_worker/bin/cryosparcw: line 153: 112094 Killed python -c "import cryosparc_compute.run as run; run.run()" "$@"