Patch motion correction (multi) failed - AssertionError: Child process with PID 3990897 has terminated unexpectedly!

Hi all, I failed to run the T20S tutorial project in the patch motion correction step with cryoSPARC v3.3.2. The error responses are as follows:

License is valid.

Launching job on lane narval target narval ...

Launching job on cluster narval


====================== Cluster submission script: ========================
==========================================================================
#!/usr/bin/env bash
#SBATCH --job-name cryosparc_P2_J6
#SBATCH --account=def-xinli808
#SBATCH --output=/home/zming01/projects/def-xinli808/zming01/T20S/P2/J6/output.txt
#SBATCH --error=/home/zming01/projects/def-xinli808/zming01/T20S/P2/J6/error.txt
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=6
#SBATCH --cpus-per-task=1
#SBATCH --threads-per-core=1
#SBATCH --gres=gpu:1

module load cuda/11.0
mkdir -p ${SLURM_TMPDIR}/cryosparc_cache
/project/def-xinli808/zming01/cryosparc/cryosparc_worker/bin/cryosparcw run --project P2 --job J6 --master_hostname nc20128.narval.calcul.quebec --master_command_core_port 39002 > /home/zming01/projects/def-xinli808/zming01/T20S/P2/J6/job.log 2>&1 
==========================================================================
==========================================================================

-------- Submission command: 
sbatch /home/zming01/projects/def-xinli808/zming01/T20S/P2/J6/queue_sub_script.sh

-------- Cluster Job ID: 
8821379

-------- Queued on cluster at 2022-08-22 16:42:40.838810

-------- Job status at 2022-08-22 16:42:40.875890
          JOBID     USER      ACCOUNT           NAME  ST  TIME_LEFT NODES CPUS TRES_PER_N MIN_MEM NODELIST (REASON) 
        8821379  zming01 def-xinli808 cryosparc_P2_J  PD    1:00:00     1    6 gres:gpu:1    256M  (None) 

[CPU: 70.1 MB]   Project P2 Job J6 Started

[CPU: 70.1 MB]   Master running v3.3.2, worker running v3.3.2

[CPU: 70.2 MB]   Working in directory: /home/zming01/projects/def-xinli808/zming01/T20S/P2/J6

[CPU: 70.2 MB]   Running on lane narval

[CPU: 70.2 MB]   Resources allocated: 

[CPU: 70.2 MB]     Worker:  narval

[CPU: 70.2 MB]     CPU   :  [0, 1, 2, 3, 4, 5]

[CPU: 70.2 MB]     GPU   :  [0]

[CPU: 70.2 MB]     RAM   :  [0, 1]

[CPU: 70.4 MB]     SSD   :  False

[CPU: 70.4 MB]   --------------------------------------------------------------

[CPU: 70.4 MB]   Importing job module for job type patch_motion_correction_multi...

[CPU: 216.0 MB]  Job ready to run

[CPU: 216.0 MB]  ***************************************************************

[CPU: 216.3 MB]  Job will process this many movies:  20

[CPU: 216.3 MB]  parent process is 3989340

[CPU: 170.6 MB]  Calling CUDA init from 3990897

[CPU: 300.2 MB]  -- 0.0: processing 0 of 20: J1/imported/012587441680555554194_14sep05c_00024sq_00003hl_00002es.frames.tif
        loading /home/zming01/projects/def-xinli808/zming01/T20S/P2/J1/imported/012587441680555554194_14sep05c_00024sq_00003hl_00002es.frames.tif
        Loading raw movie data from J1/imported/012587441680555554194_14sep05c_00024sq_00003hl_00002es.frames.tif ...

[CPU: 171.4 MB]  Outputting partial results now...

[CPU: 172.1 MB]  Traceback (most recent call last):
  File "cryosparc_worker/cryosparc_compute/run.py", line 85, in cryosparc_compute.run.main
  File "cryosparc_worker/cryosparc_compute/jobs/motioncorrection/run_patch.py", line 402, in cryosparc_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi
AssertionError: Child process with PID 3990897 has terminated unexpectedly!

Any suggestions would be helpful, thank you!

Zhenhua

Hi Zhenhua,

Please can you post the contents of
/home/zming01/projects/def-xinli808/zming01/T20S/P2/J6/job.log

Hi wtempel,

Here are the contents,

[zming01@narval1 J6]$ cat job.log


================= CRYOSPARCW =======  2022-08-22 16:55:04.086154  =========
Project P2 Job J6
Master nc20128.narval.calcul.quebec Port 39002
===========================================================================
========= monitor process now starting main process
MAINPROCESS PID 3989340
MAIN PID 3989340
motioncorrection.run_patch cryosparc_compute.jobs.jobregister
========= monitor process now waiting for main process
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
***************************************************************
Running job on hostname %s narval
Allocated Resources :  {'fixed': {'SSD': False}, 'hostname': 'narval', 'lane': 'narval', 'lane_type': 'narval', 'license': True, 'licenses_acquired': 1, 'slots': {'CPU': [0, 1, 2, 3, 4, 5], 'GPU': [0], 'RAM': [0, 1]}, 'target': {'cache_path': '/localscratch/zming01.*/cryosparc_cache', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'hostname': 'narval', 'lane': 'narval', 'name': 'narval', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash\n#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH --account=def-xinli808\n#SBATCH --output={{ job_dir_abs }}/output.txt\n#SBATCH --error={{ job_dir_abs }}/error.txt\n{%- if num_gpu == 0 %}\n#SBATCH --ntasks={{ num_cpu }}\n#SBATCH --cpus-per-task=1\n#SBATCH --threads-per-core=1\n{%- else %}\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node={{ num_cpu }}\n#SBATCH --cpus-per-task=1\n#SBATCH --threads-per-core=1\n#SBATCH --gres=gpu:{{ num_gpu }}\n{%- endif %}\n\nmodule load cuda/11.0\nmkdir -p ${SLURM_TMPDIR}/cryosparc_cache\n{{ run_cmd }}\n', 'send_cmd_tpl': '{{ command }}', 'title': 'narval', 'type': 'cluster', 'worker_bin_path': '/project/def-xinli808/zming01/cryosparc/cryosparc_worker/bin/cryosparcw'}}
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
**** handle exception rc
set status to failed
========= main process now complete.
========= monitor process now complete.

Thanks
Zhenhua

Do output.txt or error.txt inside
/home/zming01/projects/def-xinli808/zming01/T20S/P2/J6/
hold useful information?

The output.txt file is empty. The error.txt file contains the following contents:

[zming01@narval4 J6]$ cat error.txt

Due to MODULEPATH changes, the following have been reloaded:
  1) libfabric/1.10.1     2) openmpi/4.0.3     3) ucx/1.8.0

slurmstepd: error: Detected 1 oom-kill event(s) in StepId=8821379.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

You may need to include a line
#SBATCH --mem={{ ram_gb | int }}G
in your cluster_script.sh and re-run cryosparcm cluster connect.
If you still get out-of-memory errors, you may experiment with multipliers like:
#SBATCH --mem={{ ram_gb * 2 | int }}G
Because a larger memory request may spend more time in the queue, you may want to specify two or more cryoSPARC cluster “lanes” with varying --mem= settings in cluster_script.sh and distinct "name": and "title": values in cluster_info.json (see guide). This way, jobs with smaller memory requirement may spend less time in the queue.

Hi wtempel,
It did work for me and the error has been fixed.
Thanks for the help! I appreciate the cryoSPARC community and look forward to learning more here.
Zhenhua