Non uniform refinement failure

skumarb1 · February 17, 2025, 4:28am

Hi all,
I have been trying to carry out Non-uniform refinement for a while, and no matter how I carry out the refinement, it fails after running for a while. The following is one of the logs that I get back. I am no expert in cryosparc, and any help regarding troubleshooting this, big or small, is very much appreciated.

MAIN PROCESS PID 112094
========= now starting main process at 2025-02-16 22:44:24.762040
refine.newrun cryosparc_compute.jobs.jobregister
MONITOR PROCESS PID 112096
========= monitor process now waiting for main process
========= sending heartbeat at 2025-02-16 22:44:26.640575
========= sending heartbeat at 2025-02-16 22:44:36.654210
***************************************************************
Transparent hugepages setting: [always] madvise never

Running job  J92  of type  nonuniform_refine_new
Running job on hostname %s a100_72h
Allocated Resources :  {'fixed': {'SSD': True}, 'hostname': 'a100_72h', 'lane': 'a100_72h', 'lane_type': 'cluster', 'license': True, 'licenses_acquired': 1, 'slots': {'CPU': [0, 1, 2, 3], 'GPU': [0], 'RAM': [0, 1, 2]}, 'target': {'cache_path': '/cryo/jspangl4/cryosparc/cache', 'cache_quota_mb': 12000000, 'cache_reserve_mb': 80000, 'custom_var_names': [], 'custom_vars': {}, 'desc': None, 'hostname': 'a100_72h', 'lane': 'a100_72h', 'name': 'a100_72h', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/bin/bash\n###!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }}            - the complete command string to run the job\n## {{ num_cpu }}            - the number of CPUs needed\n## {{ num_gpu }}            - the number of GPUs needed. \n##                            Note: the code will use this many GPUs starting from dev id 0\n##                                  the cluster scheduler or this script have the responsibility\n##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n##                                  using the correct cluster-allocated GPUs.\n## {{ ram_gb }}             - the amount of RAM needed in GB\n## {{ job_dir_abs }}        - absolute path to the job directory\n## {{ project_dir_abs }}    - absolute path to the project dir\n## {{ job_log_path_abs }}   - absolute path to the log file for the job\n## {{ worker_bin_path }}    - absolute path to the cryosparc worker command\n## {{ run_args }}           - arguments to be passed to cryosparcw run\n## {{ project_uid }}        - uid of the project\n## {{ job_uid }}            - uid of the job\n##\n## What follows is a simple SLURM script:\n\t{%- if num_gpu == 0 %}\n#SBATCH --partition=parallel\n#SBATCH  --account=jspangl4\n#SBATCH --qos=normal\n\t{%- else %}\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH --partition=a100\n#SBATCH --qos=qos_gpu_cryo\n#SBATCH  --account=jspangl4_gpu\n\t{%- endif %}\n\n\n\n        {%- if num_gpu == 0 %}\n#SBATCH --ntasks-per-node={{ num_cpu }}\n                {%- if  ram_gb/(4*num_cpu) > 2  %}\n#SBATCH --cpus-per-task=3\n                {%- elif  ram_gb/(4*num_cpu) > 1  %}\n#SBATCH --cpus-per-task=2\n                {%- else %}\n#SBATCH --cpus-per-task=2\n                {%- endif %}\n        {%- else %}\n#SBATCH --ntasks-per-node={{ num_cpu }}\n##              {%- if  ram_gb/(4*num_cpu*((12*num_gpu/num_cpu)|int)) > 2  %}\n## #SBATCH --cpus-per-task={{ (36*num_gpu/num_cpu)|int }}\n##               {%- elif  ram_gb/(4*num_cpu*((12*num_gpu/num_cpu)|int)) > 1  %}\n## #SBATCH --cpus-per-task={{ (24*num_gpu/num_cpu)|int }}\n##               {%- else %}\n#SBATCH --cpus-per-task={{ (12*num_gpu/num_cpu)|int }}       \n##               {%- endif %}\n        {%- endif %}\n\n\n\n\n#SBATCH --exclude=/cryo/jspangl4/cryosparc/cluster_configs/.nodes.list\n#SBATCH --job-name=cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH -o {{ job_dir_abs }}/cryosparc_{{ project_uid }}_{{ job_uid }}.out\n#SBATCH -e {{ job_dir_abs }}/cryosparc_{{ project_uid }}_{{ job_uid }}.err\n#SBATCH --time=72:00:00\n\n## #SBATCH --ntasks-per-node={{ num_gpu * 12}}  # each GPU means 12 CPU in A100 queue, and each GPU means 16 CPU in ICA100 queue\n\n\n#module restore\n#module load cuda/11.8.0\numask 0027\n\n\n\navailable_devs=""\nfor devidx in $(seq 0 15);\ndo\n    if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then\n        if [[ -z "$available_devs" ]] ; then\n            available_devs=$devidx\n        else\n            available_devs=$available_devs,$devidx\n        fi\n    fi\ndone\nexport CUDA_VISIBLE_DEVICES=$available_devs\necho $CUDA_VISIBLE_DEVICES\n{{ run_cmd }}\n\n\n\n\n\n\n', 'send_cmd_tpl': 'ssh devcryo.cm.cluster  {{ command }}', 'title': 'a100_72h', 'tpl_vars': ['num_gpu', 'project_uid', 'job_log_path_abs', 'job_dir_abs', 'run_cmd', 'command', 'ram_gb', 'worker_bin_path', 'job_uid', 'num_cpu', 'run_args', 'project_dir_abs', 'cluster_job_id'], 'type': 'cluster', 'worker_bin_path': '/cryo/jspangl4/cryosparc/cryosparc_worker/bin/cryosparcw'}}
2025-02-16 22:44:41,094 run_with_executor    INFO     | Resolving 8117 source path(s) for caching
========= sending heartbeat at 2025-02-16 22:44:46.668412
========= sending heartbeat at 2025-02-16 22:44:56.678330
========= sending heartbeat at 2025-02-16 22:45:06.693349
========= sending heartbeat at 2025-02-16 22:45:16.707995
========= sending heartbeat at 2025-02-16 22:45:26.722962
========= sending heartbeat at 2025-02-16 22:45:36.738337
2025-02-16 22:45:38,774 run_with_executor    INFO     | Resolved 8117 sources in 57.68 seconds
2025-02-16 22:45:38,791 allocate             INFO     | Cache allocation start. Active run IDs: P4-J90-1739659731, P5-J2-1739759385, P4-J92-1739781839
2025-02-16 22:45:39,198 refresh              INFO     | Refreshed cache drive in 0.41 seconds
2025-02-16 22:45:39,718 cleanup_junk_files   INFO     | Removed 1 invalid item(s) in the cache
2025-02-16 22:45:39,802 refresh              INFO     | Refreshed cache drive in 0.08 seconds
2025-02-16 22:45:39,813 allocate             INFO     | Deleted 0 cached files, encountered 0 errors
2025-02-16 22:45:39,813 allocate             INFO     | Allocated 0 stub cache files; creating links
2025-02-16 22:45:40,490 allocate             INFO     | Cache allocation complete
2025-02-16 22:45:40,491 run_with_executor    INFO     | Cache allocation ran in 1.70 seconds
2025-02-16 22:45:40,491 run_with_executor    INFO     | Found 8117 SSD hit(s)
2025-02-16 22:45:40,491 run_with_executor    INFO     | Requested files successfully cached to SSD
2025-02-16 22:45:40,900 run_with_executor    INFO     | SSD cache complete
========= sending heartbeat at 2025-02-16 22:45:46.752398
========= sending heartbeat at 2025-02-16 22:45:56.766626
========= sending heartbeat at 2025-02-16 22:46:06.781204
========= sending heartbeat at 2025-02-16 22:46:16.795327
========= sending heartbeat at 2025-02-16 22:46:26.809683
WARNING: io_uring support disabled (not supported by kernel), I/O performance may degrade
========= sending heartbeat at 2025-02-16 22:46:36.824481
========= sending heartbeat at 2025-02-16 22:46:46.833330
gpufft: creating new cufft plan (plan id 0   pid 112094) 
	gpu_id  0 
	ndims   2 
	dims    640 640 0 
	inembed 640 640 0 
	istride 1 
	idist   409600 
	onembed 640 640 0 
	ostride 1 
	odist   409600 
	batch   500 
	type    C2C 
	wkspc   automatic 
	Python traceback:

gpufft: creating new cufft plan (plan id 1   pid 112094) 
	gpu_id  0 
	ndims   2 
	dims    640 640 0 
	inembed 640 640 0 
	istride 1 
	idist   409600 
	onembed 640 640 0 
	ostride 1 
	odist   409600 
	batch   500 
	type    C2C 
	wkspc   automatic 
	Python traceback:

HOST ALLOCATION FUNCTION: using numba.cuda.pinned_array
/cryo/jspangl4/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/nvrtc.py:257: UserWarning: NVRTC log messages whilst compiling kernel:

kernel(35): warning #68-D: integer conversion resulted in a change of sign

kernel(44): warning #68-D: integer conversion resulted in a change of sign

kernel(17): warning #177-D: variable "N_I" was declared but never referenced


  warnings.warn(msg)
========= sending heartbeat at 2025-02-16 22:46:56.838087
<string>:1: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected
========= sending heartbeat at 2025-02-16 22:47:06.852326
========= sending heartbeat at 2025-02-16 22:47:16.866198
========= sending heartbeat at 2025-02-16 22:47:26.873332
========= sending heartbeat at 2025-02-16 22:47:36.887588
========= sending heartbeat at 2025-02-16 22:47:46.902322
========= sending heartbeat at 2025-02-16 22:47:56.916327
========= sending heartbeat at 2025-02-16 22:48:06.924068
/cryo/jspangl4/cryosparc/cryosparc_worker/cryosparc_compute/plotutil.py:571: RuntimeWarning: divide by zero encountered in log
  logabs = n.log(n.abs(fM))
========= sending heartbeat at 2025-02-16 22:48:16.938330
========= sending heartbeat at 2025-02-16 22:48:26.951326
gpufft: creating new cufft plan (plan id 2   pid 112094) 
	gpu_id  0 
	ndims   3 
	dims    640 640 640 
	inembed 640 640 642 
	istride 1 
	idist   262963200 
	onembed 640 640 321 
	ostride 1 
	odist   131481600 
	batch   1 
	type    R2C 
	wkspc   automatic 
	Python traceback:

========= sending heartbeat at 2025-02-16 22:48:36.965601
========= sending heartbeat at 2025-02-16 22:48:46.979630
gpufft: creating new cufft plan (plan id 3   pid 112094) 
	gpu_id  0 
	ndims   2 
	dims    640 640 0 
	inembed 640 642 0 
	istride 1 
	idist   410880 
	onembed 640 321 0 
	ostride 1 
	odist   205440 
	batch   500 
	type    R2C 
	wkspc   automatic 
	Python traceback:

gpufft: creating new cufft plan (plan id 4   pid 112094) 
	gpu_id  0 
	ndims   2 
	dims    640 640 0 
	inembed 640 642 0 
	istride 1 
	idist   410880 
	onembed 640 321 0 
	ostride 1 
	odist   205440 
	batch   500 
	type    R2C 
	wkspc   automatic 
	Python traceback:

/cryo/jspangl4/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/dispatcher.py:536: NumbaPerformanceWarning: Grid size 1 will likely result in GPU under-utilization due to low occupancy.
  warn(NumbaPerformanceWarning(msg))
========= sending heartbeat at 2025-02-16 22:48:56.986732
========= sending heartbeat at 2025-02-16 22:49:06.993324
========= sending heartbeat at 2025-02-16 22:49:17.007527
========= sending heartbeat at 2025-02-16 22:49:27.021326
========= sending heartbeat at 2025-02-16 22:49:37.033798
========= sending heartbeat at 2025-02-16 22:49:47.047854
========= sending heartbeat at 2025-02-16 22:49:57.062213
========= sending heartbeat at 2025-02-16 22:50:07.076322
========= sending heartbeat at 2025-02-16 22:50:17.090810
========= sending heartbeat at 2025-02-16 22:50:27.104926
========= sending heartbeat at 2025-02-16 22:50:37.119444
gpufft: creating new cufft plan (plan id 5   pid 112094) 
	gpu_id  0 
	ndims   2 
	dims    640 640 0 
	inembed 640 642 0 
	istride 1 
	idist   410880 
	onembed 640 321 0 
	ostride 1 
	odist   205440 
	batch   218 
	type    R2C 
	wkspc   automatic 
	Python traceback:

<string>:1: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected
========= sending heartbeat at 2025-02-16 22:50:47.134017
========= sending heartbeat at 2025-02-16 22:50:57.148924
========= sending heartbeat at 2025-02-16 22:51:07.163325
========= sending heartbeat at 2025-02-16 22:51:17.176325
========= sending heartbeat at 2025-02-16 22:51:27.190332
========= sending heartbeat at 2025-02-16 22:51:37.203325
========= sending heartbeat at 2025-02-16 22:51:47.218324
========= sending heartbeat at 2025-02-16 22:51:57.232324
========= sending heartbeat at 2025-02-16 22:52:07.245805
========= sending heartbeat at 2025-02-16 22:52:17.260049
========= sending heartbeat at 2025-02-16 22:52:27.267338
========= sending heartbeat at 2025-02-16 22:52:37.282036
========= sending heartbeat at 2025-02-16 22:52:47.296313
gpufft: creating new cufft plan (plan id 6   pid 112094) 
	gpu_id  0 
	ndims   2 
	dims    640 640 0 
	inembed 640 642 0 
	istride 1 
	idist   410880 
	onembed 640 321 0 
	ostride 1 
	odist   205440 
	batch   219 
	type    R2C 
	wkspc   automatic 
	Python traceback:

========= sending heartbeat at 2025-02-16 22:52:57.309803
========= sending heartbeat at 2025-02-16 22:53:07.323322
========= sending heartbeat at 2025-02-16 22:53:17.337921
========= sending heartbeat at 2025-02-16 22:53:27.352129
========= sending heartbeat at 2025-02-16 22:53:37.366449
========= sending heartbeat at 2025-02-16 22:53:47.381188
========= sending heartbeat at 2025-02-16 22:53:57.395827
========= sending heartbeat at 2025-02-16 22:54:07.410808
========= sending heartbeat at 2025-02-16 22:54:17.424840
========= sending heartbeat at 2025-02-16 22:54:27.439176
========= sending heartbeat at 2025-02-16 22:54:37.453322
========= sending heartbeat at 2025-02-16 22:54:47.468237
========= sending heartbeat at 2025-02-16 22:54:57.483323
========= sending heartbeat at 2025-02-16 22:55:07.497566
========= sending heartbeat at 2025-02-16 22:55:17.512436
========= sending heartbeat at 2025-02-16 22:55:27.526681
========= sending heartbeat at 2025-02-16 22:55:37.541327
========= sending heartbeat at 2025-02-16 22:55:47.556377
========= sending heartbeat at 2025-02-16 22:55:57.570968
========= sending heartbeat at 2025-02-16 22:56:07.585579
========= sending heartbeat at 2025-02-16 22:56:17.594460
========= sending heartbeat at 2025-02-16 22:56:27.606836
========= sending heartbeat at 2025-02-16 22:56:37.621841
========= sending heartbeat at 2025-02-16 22:56:47.635323
========= sending heartbeat at 2025-02-16 22:56:57.649322
========= sending heartbeat at 2025-02-16 22:57:07.663324
========= sending heartbeat at 2025-02-16 22:57:17.677698
========= sending heartbeat at 2025-02-16 22:57:27.691535
========= sending heartbeat at 2025-02-16 22:57:37.705546
========= sending heartbeat at 2025-02-16 22:57:47.719551
========= sending heartbeat at 2025-02-16 22:57:57.733558
gpufft: creating new cufft plan (plan id 7   pid 112094) 
	gpu_id  0 
	ndims   3 
	dims    640 640 640 
	inembed 640 640 642 
	istride 1 
	idist   262963200 
	onembed 640 640 321 
	ostride 1 
	odist   131481600 
	batch   1 
	type    R2C 
	wkspc   manual 
	Python traceback:

gpufft: creating new cufft plan (plan id 8   pid 112094) 
	gpu_id  0 
	ndims   3 
	dims    320 320 320 
	inembed 320 320 161 
	istride 1 
	idist   16486400 
	onembed 320 320 322 
	ostride 1 
	odist   32972800 
	batch   1 
	type    C2R 
	wkspc   manual 
	Python traceback:

========= sending heartbeat at 2025-02-16 22:58:07.747324
<string>:1: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected
========= sending heartbeat at 2025-02-16 22:58:17.760328
<string>:1: RuntimeWarning: invalid value encountered in true_divide
========= sending heartbeat at 2025-02-16 22:58:27.774427
========= sending heartbeat at 2025-02-16 22:58:37.789205
========= sending heartbeat at 2025-02-16 22:58:47.803987
========= sending heartbeat at 2025-02-16 22:58:57.818990
gpufft: creating new cufft plan (plan id 9   pid 112094) 
	gpu_id  0 
	ndims   3 
	dims    640 640 640 
	inembed 640 640 321 
	istride 1 
	idist   131481600 
	onembed 640 640 642 
	ostride 1 
	odist   262963200 
	batch   1 
	type    C2R 
	wkspc   manual 
	Python traceback:

========= sending heartbeat at 2025-02-16 22:59:07.833975
/cryo/jspangl4/cryosparc/cryosparc_worker/bin/cryosparcw: line 153: 112094 Killed                  python -c "import cryosparc_compute.run as run; run.run()" "$@"

wtempel · February 18, 2025, 10:37pm

Welcome to the forum @skumarb1 . The job directory should, according to the cluster script template you posted, include cryosparc_P99_J199.out and cryosparc_P99_J199.err files. Do these files contain any information?
Please can you also post the output of the command

cryosparcm eventlog P99 J199 | grep -A 50 bash

In the file names and in the command above, please replace P99, J199 with the failed job’s actual project and job IDs.

What is the box size (in pixels) of the particles?

skumarb1 · February 19, 2025, 9:39pm

Hi @wtempl

the following is the output of the “.err” file
slurmstepd: error: Detected 1 oom_kill event in StepId=21523674.batch. Some of the step tasks have been OOM Killed.
while the “.out” file says:
0
The output of the following command is “cryosparcm eventlog P4 J92 | grep -A 50 bash”

[Mon, 17 Feb 2025 03:44:01 GMT]  License is valid.
[Mon, 17 Feb 2025 03:44:01 GMT]  Launching job on lane a100_72h target a100_72h ...
[Mon, 17 Feb 2025 03:44:01 GMT]  Launching job on cluster a100_72h
[Mon, 17 Feb 2025 03:44:01 GMT]
====================== Cluster submission script: ========================
==========================================================================
#!/bin/bash
###!/usr/bin/env bash
#### cryoSPARC cluster submission script template for SLURM
## Available variables:
## /cryo/jspangl4/cryosparc/cryosparc_worker/bin/cryosparcw run --project P4 --job J92 --master_hostname devcryo.cm.cluster --master_command_core_port 39902 > /scratch4/jspangl4/CS-f5111-higg1-lcic/J92/job.log 2>&1             - the complete command string to run the job
## 4            - the number of CPUs needed
## 1            - the number of GPUs needed.
##                            Note: the code will use this many GPUs starting from dev id 0
##                                  the cluster scheduler or this script have the responsibility
##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up
##                                  using the correct cluster-allocated GPUs.
## 24.0             - the amount of RAM needed in GB
## /scratch4/jspangl4/CS-f5111-higg1-lcic/J92        - absolute path to the job directory
## /scratch4/jspangl4/CS-f5111-higg1-lcic    - absolute path to the project dir
## /scratch4/jspangl4/CS-f5111-higg1-lcic/J92/job.log   - absolute path to the log file for the job
## /cryo/jspangl4/cryosparc/cryosparc_worker/bin/cryosparcw    - absolute path to the cryosparc worker command
## --project P4 --job J92 --master_hostname devcryo.cm.cluster --master_command_core_port 39902           - arguments to be passed to cryosparcw run
## P4        - uid of the project
## J92            - uid of the job
##
## What follows is a simple SLURM script:
#SBATCH --gres=gpu:1
#SBATCH --partition=a100
#SBATCH --qos=qos_gpu_cryo
#SBATCH  --account=jspangl4_gpu
#SBATCH --ntasks-per-node=4
##
#SBATCH --cpus-per-task=3
##




#SBATCH --exclude=/cryo/jspangl4/cryosparc/cluster_configs/.nodes.list
#SBATCH --job-name=cryosparc_P4_J92
#SBATCH -o /scratch4/jspangl4/CS-f5111-higg1-lcic/J92/cryosparc_P4_J92.out
#SBATCH -e /scratch4/jspangl4/CS-f5111-higg1-lcic/J92/cryosparc_P4_J92.err
#SBATCH --time=72:00:00

## #SBATCH --ntasks-per-node=12  # each GPU means 12 CPU in A100 queue, and each GPU means 16 CPU in ICA100 queue


#module restore
#module load cuda/11.8.0
umask 0027



available_devs=""
for devidx in $(seq 0 15);
do
    if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then
        if [[ -z "$available_devs" ]] ; then
            available_devs=$devidx
        else
            available_devs=$available_devs,$devidx
        fi
    fi
done
export CUDA_VISIBLE_DEVICES=$available_devs
echo $CUDA_VISIBLE_DEVICES
/cryo/jspangl4/cryosparc/cryosparc_worker/bin/cryosparcw run --project P4 --job J92 --master_hostname devcryo.cm.cluster --master_command_core_port 39902 > /scratch4/jspangl4/CS-f5111-higg1-lcic/J92/job.log 2>&1






==========================================================================
==========================================================================
[Mon, 17 Feb 2025 03:44:01 GMT]  -------- Submission command:
ssh devcryo.cm.cluster  sbatch /scratch4/jspangl4/CS-f5111-higg1-lcic/J92/queue_sub_script.sh
[Mon, 17 Feb 2025 03:44:02 GMT]  -------- Cluster Job ID:
21523674
[Mon, 17 Feb 2025 03:44:02 GMT]  -------- Queued on cluster at 2025-02-16 22:44:02.245536
[Mon, 17 Feb 2025 03:44:03 GMT]  -------- Cluster job status at 2025-02-16 22:44:24.796106 (2 retries)
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          21523674      a100 cryospar cryo-jsp  R       0:01      1 gpu17
[Mon, 17 Feb 2025 03:44:25 GMT] [CPU RAM used: 90 MB] Job J92 Started
[Mon, 17 Feb 2025 03:44:25 GMT] [CPU RAM used: 91 MB] Master running v4.6.2, worker running v4.6.2
[Mon, 17 Feb 2025 03:44:25 GMT] [CPU RAM used: 91 MB] Working in directory: /scratch4/jspangl4/CS-f5111-higg1-lcic/J92
[Mon, 17 Feb 2025 03:44:25 GMT] [CPU RAM used: 91 MB] Running on lane a100_72h
[Mon, 17 Feb 2025 03:44:25 GMT] [CPU RAM used: 91 MB] Resources allocated:
[Mon, 17 Feb 2025 03:44:25 GMT] [CPU RAM used: 91 MB]   Worker:  a100_72h
[Mon, 17 Feb 2025 03:44:25 GMT] [CPU RAM used: 91 MB]   CPU   :  [0, 1, 2, 3]
[Mon, 17 Feb 2025 03:44:25 GMT] [CPU RAM used: 91 MB]   GPU   :  [0]
[Mon, 17 Feb 2025 03:44:25 GMT] [CPU RAM used: 91 MB]   RAM   :  [0, 1, 2]
[Mon, 17 Feb 2025 03:44:25 GMT] [CPU RAM used: 91 MB]   SSD   :  True
[Mon, 17 Feb 2025 03:44:25 GMT] [CPU RAM used: 91 MB] --------------------------------------------------------------
[Mon, 17 Feb 2025 03:44:25 GMT] [CPU RAM used: 91 MB] Importing job module for job type nonuniform_refine_new...
[Mon, 17 Feb 2025 03:44:34 GMT] [CPU RAM used: 327 MB] Job ready to run
[Mon, 17 Feb 2025 03:44:34 GMT] [CPU RAM used: 327 MB] ***************************************************************
[Mon, 17 Feb 2025 03:44:34 GMT] [CPU RAM used: 327 MB] Transparent hugepages are enabled. You may encounter stalls or performance problems with CryoSPARC jobs.
[Mon, 17 Feb 2025 03:44:40 GMT] [CPU RAM used: 886 MB] Using random seed of 1384219097
[Mon, 17 Feb 2025 03:44:40 GMT] [CPU RAM used: 983 MB] Loading a ParticleStack with 920000 items...
[Mon, 17 Feb 2025 03:44:41 GMT] [CPU RAM used: 983 MB] ──────────────────────────────────────────────────────────────
SSD cache ACTIVE at /cryo/jspangl4/cryosparc/cache/instance_devcryo.cm.cluster:39901 (80 GB reserve) (12 TB quota)
┌─────────────────────┬───────────────────────┐
│ Cache usage         │                Amount │
├─────────────────────┼───────────────────────┤
│ Total / Usable      │   5.00 TiB / 4.93 TiB │
│ Used / Free         │ 3.30 TiB / 852.85 GiB │
│ Hits / Misses       │     3.30 TiB / 0.00 B │
│ Acquired / Required │     3.30 TiB / 0.00 B │
└─────────────────────┴───────────────────────┘
Progress: [▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇] 8117/8117 (100%)
      Elapsed: 0h 00m 59s
  Active jobs: P4-J92
SSD cache complete for 8117 file(s)
──────────────────────────────────────────────────────────────
[Mon, 17 Feb 2025 03:46:00 GMT] [CPU RAM used: 1434 MB]   Done.
[Mon, 17 Feb 2025 03:46:00 GMT] [CPU RAM used: 1434 MB] Windowing particles
[Mon, 17 Feb 2025 03:46:00 GMT] [CPU RAM used: 1438 MB]   Done.
[Mon, 17 Feb 2025 03:46:00 GMT] [CPU RAM used: 1438 MB] ====== Gold Standard Split ======
[Mon, 17 Feb 2025 03:46:00 GMT] [CPU RAM used: 1438 MB]   Particles have input alignments3D connected, so reusing pre-existing split
[Mon, 17 Feb 2025 03:46:00 GMT] [CPU RAM used: 1438 MB]   Half set B is size 0 (splits are unassigned), will resplit particles!
[Mon, 17 Feb 2025 03:46:04 GMT] [CPU RAM used: 1661 MB]   Split A has 460000 particles
[Mon, 17 Feb 2025 03:46:04 GMT] [CPU RAM used: 1661 MB]   Split B has 460000 particles
[Mon, 17 Feb 2025 03:46:04 GMT] [CPU RAM used: 1661 MB] ====== Refinement ======
[Mon, 17 Feb 2025 03:46:04 GMT] [CPU RAM used: 1661 MB]   Input particles have box size 640
[Mon, 17 Feb 2025 03:46:04 GMT] [CPU RAM used: 1661 MB]   Input particles have pixel size 0.7200
[Mon, 17 Feb 2025 03:46:04 GMT] [CPU RAM used: 1661 MB]   Particles will be zeropadded/truncated to size 640 during alignment
[Mon, 17 Feb 2025 03:46:04 GMT] [CPU RAM used: 1661 MB]   Volume refinement will be done with effective box size 640
[Mon, 17 Feb 2025 03:46:04 GMT] [CPU RAM used: 1661 MB]   Volume refinement will be done with pixel size 0.7200
[Mon, 17 Feb 2025 03:46:04 GMT] [CPU RAM used: 1661 MB]   Particles will be zeropadded/truncated to size 640 during backprojection
[Mon, 17 Feb 2025 03:46:04 GMT] [CPU RAM used: 1661 MB]   Particles will be backprojected with box size 640
[Mon, 17 Feb 2025 03:46:04 GMT] [CPU RAM used: 1661 MB]   Volume will be internally cropped and stored with box size 640
[Mon, 17 Feb 2025 03:46:04 GMT] [CPU RAM used: 1661 MB]   Volume will be interpolated with box size 640 (zeropadding factor 1.00)
[Mon, 17 Feb 2025 03:46:04 GMT] [CPU RAM used: 1661 MB]   DC components of images will be ignored and volume will be floated at each iteration.
[Mon, 17 Feb 2025 03:46:04 GMT] [CPU RAM used: 1661 MB]   Spherical windowing of maps is enabled
[Mon, 17 Feb 2025 03:46:04 GMT] [CPU RAM used: 1661 MB]   Refining with C1 symmetry enforced
[Mon, 17 Feb 2025 03:46:04 GMT] [CPU RAM used: 1661 MB] Refining with pose and shift marginalization during backprojection enabled.
[Mon, 17 Feb 2025 03:46:04 GMT] [CPU RAM used: 1661 MB]   Resetting input per-particle scale factors to 1.0
[Mon, 17 Feb 2025 03:46:04 GMT] [CPU RAM used: 1661 MB]   Starting at initial resolution 30.000A (radwn 15.360).
[Mon, 17 Feb 2025 03:46:04 GMT] [CPU RAM used: 1661 MB] ====== Non-Uniform Refinement ======
[Mon, 17 Feb 2025 03:46:04 GMT] [CPU RAM used: 1661 MB]   Non-Uniform Refinement is enabled.
[Mon, 17 Feb 2025 03:46:04 GMT] [CPU RAM used: 1661 MB]   Using AWF of 3.00.
[Mon, 17 Feb 2025 03:46:04 GMT] [CPU RAM used: 1661 MB]   Using butterworth filter with order 8.
[Mon, 17 Feb 2025 03:46:04 GMT] [CPU RAM used: 1661 MB] ====== Masking ======
[Mon, 17 Feb 2025 03:46:14 GMT] [CPU RAM used: 5915 MB]   No mask input was connected, so dynamic masking will be enabled.
[Mon, 17 Feb 2025 03:46:14 GMT] [CPU RAM used: 5915 MB]   Dynamic mask threshold: 0.2000
[Mon, 17 Feb 2025 03:46:14 GMT] [CPU RAM used: 5915 MB]   Dynamic mask near (A): 6.00
[Mon, 17 Feb 2025 03:46:14 GMT] [CPU RAM used: 5915 MB]   Dynamic mask far  (A): 14.00
[Mon, 17 Feb 2025 03:46:14 GMT] [CPU RAM used: 5915 MB] ====== Initial Model ======
[Mon, 17 Feb 2025 03:46:14 GMT] [CPU RAM used: 5915 MB]   Resampling initial model to specified volume representation size and pixel-size...
[Mon, 17 Feb 2025 03:46:25 GMT] [CPU RAM used: 8970 MB]   Estimating scale of initial reference.
[Mon, 17 Feb 2025 03:46:30 GMT] [CPU RAM used: 89 MB] WARNING: io_uring support disabled (not supported by kernel), I/O performance may degrade
[Mon, 17 Feb 2025 03:46:57 GMT] [CPU RAM used: 9255 MB]   Rescaling initial reference by a factor of 1.589
[Mon, 17 Feb 2025 03:47:04 GMT] [CPU RAM used: 9255 MB]   Estimating scale of initial reference.
[Mon, 17 Feb 2025 03:47:32 GMT] [CPU RAM used: 9255 MB]   Rescaling initial reference by a factor of 0.998
[Mon, 17 Feb 2025 03:47:39 GMT] [CPU RAM used: 9255 MB]   Estimating scale of initial reference.
[Mon, 17 Feb 2025 03:48:07 GMT] [CPU RAM used: 9255 MB]   Rescaling initial reference by a factor of 0.995
[Mon, 17 Feb 2025 03:48:13 GMT]  Initial Real Space Slices
[Mon, 17 Feb 2025 03:48:15 GMT]  Initial Fourier Space Slices
[Mon, 17 Feb 2025 03:48:15 GMT] [CPU RAM used: 9411 MB] ====== Starting Refinement Iterations ======
[Mon, 17 Feb 2025 03:48:15 GMT] [CPU RAM used: 9411 MB] ----------------------------- Start Iteration 0
[Mon, 17 Feb 2025 03:48:15 GMT] [CPU RAM used: 9411 MB]   Using Max Alignment Radius 15.360 (30.000A)
[Mon, 17 Feb 2025 03:48:15 GMT] [CPU RAM used: 9411 MB]  Auto batchsize: 10437 in each split
[Mon, 17 Feb 2025 03:48:27 GMT] [CPU RAM used: 13540 MB] -- THR 1 BATCH 500 NUM 2719 TOTAL 3.9985756 ELAPSED 544.11942 --
[Mon, 17 Feb 2025 03:57:37 GMT] [CPU RAM used: 22583 MB]   Processed 20874.000 images in 549.566s.
[Mon, 17 Feb 2025 03:57:57 GMT] [CPU RAM used: 24711 MB]   Computing FSCs...
[Mon, 17 Feb 2025 03:57:57 GMT] [CPU RAM used: 24712 MB]   Using full box size 640, downsampled box size 320, with low memory mode disabled.
[Mon, 17 Feb 2025 03:57:57 GMT] [CPU RAM used: 24712 MB]   Computing FFTs on GPU.
[Mon, 17 Feb 2025 03:58:09 GMT] [CPU RAM used: 26712 MB]     Done in 12.505s
[Mon, 17 Feb 2025 03:58:09 GMT] [CPU RAM used: 26712 MB]   Computing cFSCs...
[Mon, 17 Feb 2025 03:58:24 GMT] [CPU RAM used: 26720 MB]     Done in 14.452s
[Mon, 17 Feb 2025 03:58:24 GMT] [CPU RAM used: 26720 MB]   Using Filter Radius 56.171 (8.203A) | Previous: 15.360 (30.000A)
[Mon, 17 Feb 2025 03:58:58 GMT] [CPU RAM used: 34730 MB]   Non-uniform regularization with compute option: GPU
[Mon, 17 Feb 2025 03:58:58 GMT] [CPU RAM used: 34730 MB]   Running local cross validation for A ...
[Mon, 17 Feb 2025 04:02:08 GMT]  **** Kill signal sent by CryoSPARC (ID: <Heartbeat Monitor>) ****
[Mon, 17 Feb 2025 04:02:09 GMT]  Job is unresponsive - no heartbeat received in 180 seconds.

The boxsize is 640

Please let me know if you need any other information.

wtempel · February 19, 2025, 10:38pm

It is possible that, given the job type and input, RAM usage exceeded the allocated amount of RAM. You may want to consult your IT support for the best approach.
The {{ ram_gb }} template variable may be used with the
#SBATCH --mem= option, but, given the box size, may underestimate the RAM amount actually required and may need to be augmented with a custom multiplier, like in On Ubuntu cluster, cryosparc has too little memory allocated by default and is frequently killed by OOM killer - #2 by leetleyang.