Inspect pick job error

qchen · January 8, 2024, 2:20pm

Hi,

I got a error when running the inspect job, which I never encounted before.

The log file is as below:

MAINPROCESS PID 518870
MAIN PID 518870
interactive.run_inspect_picks_v2 cryosparc_compute.jobs.jobregister
========= monitor process now waiting for main process

INTERACTIVE JOB STARTED === 2024-01-08 13:56:07.542323 ==========================
========= sending heartbeat at 2024-01-08 13:56:15.570767
========= sending heartbeat at 2024-01-08 13:56:25.647065
========= sending heartbeat at 2024-01-08 13:56:35.657188
========= sending heartbeat at 2024-01-08 13:56:45.734595
========= sending heartbeat at 2024-01-08 13:56:55.836303
========= sending heartbeat at 2024-01-08 13:57:06.875793
========= sending heartbeat at 2024-01-08 13:57:29.928759
========= sending heartbeat at 2024-01-08 13:57:44.705469
========= sending heartbeat at 2024-01-08 13:57:58.021346
========= sending heartbeat at 2024-01-08 13:58:10.809148
========= sending heartbeat at 2024-01-08 13:58:24.619133
========= sending heartbeat at 2024-01-08 13:58:38.654659
========= sending heartbeat at 2024-01-08 13:58:56.405787
========= sending heartbeat at 2024-01-08 13:59:14.665339
========= sending heartbeat at 2024-01-08 13:59:27.374014
========= sending heartbeat at 2024-01-08 13:59:47.619253
========= sending heartbeat at 2024-01-08 14:00:19.188560
========= sending heartbeat at 2024-01-08 14:00:37.380541
========= sending heartbeat at 2024-01-08 14:00:50.451157
========= sending heartbeat at 2024-01-08 14:01:01.844165
========= sending heartbeat at 2024-01-08 14:01:15.374036
========= sending heartbeat at 2024-01-08 14:01:28.172043
========= sending heartbeat at 2024-01-08 14:01:41.639090
========= sending heartbeat at 2024-01-08 14:01:56.021621
========= sending heartbeat at 2024-01-08 14:02:08.901633
========= sending heartbeat at 2024-01-08 14:02:24.847553
========= sending heartbeat at 2024-01-08 14:02:37.613791
========= sending heartbeat at 2024-01-08 14:02:52.890997
========= sending heartbeat at 2024-01-08 14:03:15.024436
dataset.more_memory: out of memory (errno 12: Cannot allocate memory)
**** handle exception rc
========= sending heartbeat at 2024-01-08 14:03:27.857363
========= sending heartbeat at 2024-01-08 14:03:43.460062
/net/flash/flash/qchen/cryosparc/cryosparc_master/cryosparc_compute/jobs/motioncorrection/mic_utils.py:95: NumbaDeprecationWarning: The ‘nopython’ keyword argument was not supplied to the ‘numba.jit’ decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See Deprecation Notices — Numba 0+untagged.4124.gd4460fe.dirty documentation for details.
@jit(nogil=True)
/net/flash/flash/qchen/cryosparc/cryosparc_master/cryosparc_compute/micrographs.py:563: NumbaDeprecationWarning: The ‘nopython’ keyword argument was not supplied to the ‘numba.jit’ decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See Deprecation Notices — Numba 0+untagged.4124.gd4460fe.dirty documentation for details.
def contrast_normalization(arr_bin, tile_size = 128):
Traceback (most recent call last):
File “/net/flash/flash/qchen/cryosparc/cryosparc_master/cryosparc_tools/cryosparc/dataset.py”, line 554, in load
dset = cls(indata)
File “/net/flash/flash/qchen/cryosparc/cryosparc_master/cryosparc_tools/cryosparc/dataset.py”, line 750, in init
self[field[0]] = data
File “/net/flash/flash/qchen/cryosparc/cryosparc_master/cryosparc_tools/cryosparc/dataset.py”, line 818, in setitem
self[key][:] = val
ValueError: could not broadcast input array from shape (34027060,) into shape (0,)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File “cryosparc_master/cryosparc_compute/run.py”, line 95, in cryosparc_master.cryosparc_compute.run.main
File “/net/flash/flash/qchen/cryosparc/cryosparc_master/cryosparc_compute/jobs/interactive/run_inspect_picks_v2.py”, line 79, in run
particles_dset = rc.load_input_group(‘particles’)
File “/net/flash/flash/qchen/cryosparc/cryosparc_master/cryosparc_compute/jobs/runcommon.py”, line 678, in load_input_group
dsets = [load_input_connection_slots(input_group_name, keep_slot_names, idx, allow_passthrough=allow_passthrough, memoize=memoize) for idx in range(num_connections)]
File “/net/flash/flash/qchen/cryosparc/cryosparc_master/cryosparc_compute/jobs/runcommon.py”, line 678, in
dsets = [load_input_connection_slots(input_group_name, keep_slot_names, idx, allow_passthrough=allow_passthrough, memoize=memoize) for idx in range(num_connections)]
File “/net/flash/flash/qchen/cryosparc/cryosparc_master/cryosparc_compute/jobs/runcommon.py”, line 642, in load_input_connection_slots
dsets = [load_input_connection_single_slot(input_group_name, slot_name, connection_idx, allow_passthrough=allow_passthrough, memoize=memoize) for slot_name in slot_names]
File “/net/flash/flash/qchen/cryosparc/cryosparc_master/cryosparc_compute/jobs/runcommon.py”, line 642, in
dsets = [load_input_connection_single_slot(input_group_name, slot_name, connection_idx, allow_passthrough=allow_passthrough, memoize=memoize) for slot_name in slot_names]
File “/net/flash/flash/qchen/cryosparc/cryosparc_master/cryosparc_compute/jobs/runcommon.py”, line 634, in load_input_connection_single_slot
d = load_output_result_dset(_project_uid, output_result, slotconnection[‘version’], slot_name, memoize=memoize)
File “/net/flash/flash/qchen/cryosparc/cryosparc_master/cryosparc_compute/jobs/runcommon.py”, line 589, in load_output_result_dset
d = dataset.Dataset.load(abspath)
File “/net/flash/flash/qchen/cryosparc/cryosparc_master/cryosparc_tools/cryosparc/dataset.py”, line 592, in load
raise DatasetLoadError(f"Could not load dataset from file {file}") from err
cryosparc_tools.cryosparc.errors.DatasetLoadError: Could not load dataset from file /cephfs2/qchen/Cryosparc/CS/J4/picked_particles.cs
set status to failed
========= main process now complete at 2024-01-08 14:03:50.998265.
========= monitor process now complete at 2024-01-08 14:03:51.011561.

Any idea would be appreciated.

wtempel · January 8, 2024, 2:33pm

How much RAM does the CryoSPARC master computer have, and what other tasks, other than CryoSPARC master processes, does the computer handle?
Please can you post the output of the command
free -g.

qchen · January 8, 2024, 3:45pm

This is the Cluster submission script:

#!/bin/sh
#SBATCH --export=ALL
#SBATCH --output={{ job_dir_abs }}/{{ job_uid }}.out
#SBATCH --error={{ job_dir_abs }}/{{ job_uid }}.err
#SBATCH --nodes=1
#SBATCH --ntasks=1 {{ extra_params }}
#SBATCH --job-name=CS_{{ project_uid }}-{{ job_uid }}
{%- if num_gpu == 0 %}
#SBATCH --partition=cpu
#SBATCH --cpus-per-task=16
#SBATCH --mem=50G
{%- else %}
#SBATCH --partition=gpu
#SBATCH --cpus-per-task={{ num_gpu*8 }}
#SBATCH --gres=gpu:{{ num_gpu }}
{%- if constraint == "g" %}
#SBATCH --constraint=GTX
{%- elif constraint == "r" %}
#SBATCH --constraint=RTX
{%- endif %}
{%- if custom_mem %}
#SBATCH --mem={{ custom_mem }}G
{%- else %}
#SBATCH --mem=45G
{%- endif %}
{%- if grp_acct %}
#SBATCH --account=tategrp
#SBATCH --qos=24_gpu_qos
{%- endif %}
{%- endif %}
#SBATCH --open-mode=append
#SBATCH --time=7-00:00:00
#SBATCH --mail-type=FAIL

export CRYOSPARC_SSD_PATH="/ssd/${SLURM_JOB_USER}-${SLURM_JOBID}"

{{ run_cmd }}

This is the output of the command free -g.

              total        used        free      shared  buff/cache   available
Mem:            250          13         234           0           2         235
Swap:             3           0           3

qchen · January 8, 2024, 4:05pm

I have just restarted the Cryosparc and it is somehow running again with more free memory as shown in the output of free -g command. It was 3 (free) before and now it is 234 (free). So I guess the failure is caused by the shortage of the free memory on the cluster. Great thanks for your help.