Patch motion correction (multi) failed - AssertionError: Child process with PID 3990897 has terminated unexpectedly!

Hi all, I failed to run the T20S tutorial project in the patch motion correction step with cryoSPARC v3.3.2. The error responses are as follows:

License is valid.

Launching job on lane narval target narval ...

Launching job on cluster narval

====================== Cluster submission script: ========================
#!/usr/bin/env bash
#SBATCH --job-name cryosparc_P2_J6
#SBATCH --account=def-xinli808
#SBATCH --output=/home/zming01/projects/def-xinli808/zming01/T20S/P2/J6/output.txt
#SBATCH --error=/home/zming01/projects/def-xinli808/zming01/T20S/P2/J6/error.txt
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=6
#SBATCH --cpus-per-task=1
#SBATCH --threads-per-core=1
#SBATCH --gres=gpu:1

module load cuda/11.0
mkdir -p ${SLURM_TMPDIR}/cryosparc_cache
/project/def-xinli808/zming01/cryosparc/cryosparc_worker/bin/cryosparcw run --project P2 --job J6 --master_hostname --master_command_core_port 39002 > /home/zming01/projects/def-xinli808/zming01/T20S/P2/J6/job.log 2>&1 

-------- Submission command: 
sbatch /home/zming01/projects/def-xinli808/zming01/T20S/P2/J6/

-------- Cluster Job ID: 

-------- Queued on cluster at 2022-08-22 16:42:40.838810

-------- Job status at 2022-08-22 16:42:40.875890
        8821379  zming01 def-xinli808 cryosparc_P2_J  PD    1:00:00     1    6 gres:gpu:1    256M  (None) 

[CPU: 70.1 MB]   Project P2 Job J6 Started

[CPU: 70.1 MB]   Master running v3.3.2, worker running v3.3.2

[CPU: 70.2 MB]   Working in directory: /home/zming01/projects/def-xinli808/zming01/T20S/P2/J6

[CPU: 70.2 MB]   Running on lane narval

[CPU: 70.2 MB]   Resources allocated: 

[CPU: 70.2 MB]     Worker:  narval

[CPU: 70.2 MB]     CPU   :  [0, 1, 2, 3, 4, 5]

[CPU: 70.2 MB]     GPU   :  [0]

[CPU: 70.2 MB]     RAM   :  [0, 1]

[CPU: 70.4 MB]     SSD   :  False

[CPU: 70.4 MB]   --------------------------------------------------------------

[CPU: 70.4 MB]   Importing job module for job type patch_motion_correction_multi...

[CPU: 216.0 MB]  Job ready to run

[CPU: 216.0 MB]  ***************************************************************

[CPU: 216.3 MB]  Job will process this many movies:  20

[CPU: 216.3 MB]  parent process is 3989340

[CPU: 170.6 MB]  Calling CUDA init from 3990897

[CPU: 300.2 MB]  -- 0.0: processing 0 of 20: J1/imported/012587441680555554194_14sep05c_00024sq_00003hl_00002es.frames.tif
        loading /home/zming01/projects/def-xinli808/zming01/T20S/P2/J1/imported/012587441680555554194_14sep05c_00024sq_00003hl_00002es.frames.tif
        Loading raw movie data from J1/imported/012587441680555554194_14sep05c_00024sq_00003hl_00002es.frames.tif ...

[CPU: 171.4 MB]  Outputting partial results now...

[CPU: 172.1 MB]  Traceback (most recent call last):
  File "cryosparc_worker/cryosparc_compute/", line 85, in
  File "cryosparc_worker/cryosparc_compute/jobs/motioncorrection/", line 402, in
AssertionError: Child process with PID 3990897 has terminated unexpectedly!

Any suggestions would be helpful, thank you!


Hi Zhenhua,

Please can you post the contents of

Hi wtempel,

Here are the contents,

[zming01@narval1 J6]$ cat job.log

================= CRYOSPARCW =======  2022-08-22 16:55:04.086154  =========
Project P2 Job J6
Master Port 39002
========= monitor process now starting main process
MAIN PID 3989340
========= monitor process now waiting for main process
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
Running job on hostname %s narval
Allocated Resources :  {'fixed': {'SSD': False}, 'hostname': 'narval', 'lane': 'narval', 'lane_type': 'narval', 'license': True, 'licenses_acquired': 1, 'slots': {'CPU': [0, 1, 2, 3, 4, 5], 'GPU': [0], 'RAM': [0, 1]}, 'target': {'cache_path': '/localscratch/zming01.*/cryosparc_cache', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'hostname': 'narval', 'lane': 'narval', 'name': 'narval', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash\n#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH --account=def-xinli808\n#SBATCH --output={{ job_dir_abs }}/output.txt\n#SBATCH --error={{ job_dir_abs }}/error.txt\n{%- if num_gpu == 0 %}\n#SBATCH --ntasks={{ num_cpu }}\n#SBATCH --cpus-per-task=1\n#SBATCH --threads-per-core=1\n{%- else %}\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node={{ num_cpu }}\n#SBATCH --cpus-per-task=1\n#SBATCH --threads-per-core=1\n#SBATCH --gres=gpu:{{ num_gpu }}\n{%- endif %}\n\nmodule load cuda/11.0\nmkdir -p ${SLURM_TMPDIR}/cryosparc_cache\n{{ run_cmd }}\n', 'send_cmd_tpl': '{{ command }}', 'title': 'narval', 'type': 'cluster', 'worker_bin_path': '/project/def-xinli808/zming01/cryosparc/cryosparc_worker/bin/cryosparcw'}}
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
**** handle exception rc
set status to failed
========= main process now complete.
========= monitor process now complete.


Do output.txt or error.txt inside
hold useful information?

The output.txt file is empty. The error.txt file contains the following contents:

[zming01@narval4 J6]$ cat error.txt

Due to MODULEPATH changes, the following have been reloaded:
  1) libfabric/1.10.1     2) openmpi/4.0.3     3) ucx/1.8.0

slurmstepd: error: Detected 1 oom-kill event(s) in StepId=8821379.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

You may need to include a line
#SBATCH --mem={{ ram_gb | int }}G
in your and re-run cryosparcm cluster connect.
If you still get out-of-memory errors, you may experiment with multipliers like:
#SBATCH --mem={{ ram_gb * 2 | int }}G
Because a larger memory request may spend more time in the queue, you may want to specify two or more cryoSPARC cluster “lanes” with varying --mem= settings in and distinct "name": and "title": values in cluster_info.json (see guide). This way, jobs with smaller memory requirement may spend less time in the queue.

Hi wtempel,
It did work for me and the error has been fixed.
Thanks for the help! I appreciate the cryoSPARC community and look forward to learning more here.