Patch motion correction (multi) failed - AssertionError: Child process with PID 3990897 has terminated unexpectedly!

zhenhua · August 22, 2022, 9:36pm

Hi all, I failed to run the T20S tutorial project in the patch motion correction step with cryoSPARC v3.3.2. The error responses are as follows:

License is valid.

Launching job on lane narval target narval ...

Launching job on cluster narval


====================== Cluster submission script: ========================
==========================================================================
#!/usr/bin/env bash
#SBATCH --job-name cryosparc_P2_J6
#SBATCH --account=def-xinli808
#SBATCH --output=/home/zming01/projects/def-xinli808/zming01/T20S/P2/J6/output.txt
#SBATCH --error=/home/zming01/projects/def-xinli808/zming01/T20S/P2/J6/error.txt
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=6
#SBATCH --cpus-per-task=1
#SBATCH --threads-per-core=1
#SBATCH --gres=gpu:1

module load cuda/11.0
mkdir -p ${SLURM_TMPDIR}/cryosparc_cache
/project/def-xinli808/zming01/cryosparc/cryosparc_worker/bin/cryosparcw run --project P2 --job J6 --master_hostname nc20128.narval.calcul.quebec --master_command_core_port 39002 > /home/zming01/projects/def-xinli808/zming01/T20S/P2/J6/job.log 2>&1 
==========================================================================
==========================================================================

-------- Submission command: 
sbatch /home/zming01/projects/def-xinli808/zming01/T20S/P2/J6/queue_sub_script.sh

-------- Cluster Job ID: 
8821379

-------- Queued on cluster at 2022-08-22 16:42:40.838810

-------- Job status at 2022-08-22 16:42:40.875890
          JOBID     USER      ACCOUNT           NAME  ST  TIME_LEFT NODES CPUS TRES_PER_N MIN_MEM NODELIST (REASON) 
        8821379  zming01 def-xinli808 cryosparc_P2_J  PD    1:00:00     1    6 gres:gpu:1    256M  (None) 

[CPU: 70.1 MB]   Project P2 Job J6 Started

[CPU: 70.1 MB]   Master running v3.3.2, worker running v3.3.2

[CPU: 70.2 MB]   Working in directory: /home/zming01/projects/def-xinli808/zming01/T20S/P2/J6

[CPU: 70.2 MB]   Running on lane narval

[CPU: 70.2 MB]   Resources allocated: 

[CPU: 70.2 MB]     Worker:  narval

[CPU: 70.2 MB]     CPU   :  [0, 1, 2, 3, 4, 5]

[CPU: 70.2 MB]     GPU   :  [0]

[CPU: 70.2 MB]     RAM   :  [0, 1]

[CPU: 70.4 MB]     SSD   :  False

[CPU: 70.4 MB]   --------------------------------------------------------------

[CPU: 70.4 MB]   Importing job module for job type patch_motion_correction_multi...

[CPU: 216.0 MB]  Job ready to run

[CPU: 216.0 MB]  ***************************************************************

[CPU: 216.3 MB]  Job will process this many movies:  20

[CPU: 216.3 MB]  parent process is 3989340

[CPU: 170.6 MB]  Calling CUDA init from 3990897

[CPU: 300.2 MB]  -- 0.0: processing 0 of 20: J1/imported/012587441680555554194_14sep05c_00024sq_00003hl_00002es.frames.tif
        loading /home/zming01/projects/def-xinli808/zming01/T20S/P2/J1/imported/012587441680555554194_14sep05c_00024sq_00003hl_00002es.frames.tif
        Loading raw movie data from J1/imported/012587441680555554194_14sep05c_00024sq_00003hl_00002es.frames.tif ...

[CPU: 171.4 MB]  Outputting partial results now...

[CPU: 172.1 MB]  Traceback (most recent call last):
  File "cryosparc_worker/cryosparc_compute/run.py", line 85, in cryosparc_compute.run.main
  File "cryosparc_worker/cryosparc_compute/jobs/motioncorrection/run_patch.py", line 402, in cryosparc_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi
AssertionError: Child process with PID 3990897 has terminated unexpectedly!

Any suggestions would be helpful, thank you!

Zhenhua

wtempel · August 23, 2022, 1:19pm

Hi Zhenhua,

Please can you post the contents of
/home/zming01/projects/def-xinli808/zming01/T20S/P2/J6/job.log

zhenhua · August 23, 2022, 4:12pm

Hi wtempel,

Here are the contents,

[zming01@narval1 J6]$ cat job.log


================= CRYOSPARCW =======  2022-08-22 16:55:04.086154  =========
Project P2 Job J6
Master nc20128.narval.calcul.quebec Port 39002
===========================================================================
========= monitor process now starting main process
MAINPROCESS PID 3989340
MAIN PID 3989340
motioncorrection.run_patch cryosparc_compute.jobs.jobregister
========= monitor process now waiting for main process
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
***************************************************************
Running job on hostname %s narval
Allocated Resources :  {'fixed': {'SSD': False}, 'hostname': 'narval', 'lane': 'narval', 'lane_type': 'narval', 'license': True, 'licenses_acquired': 1, 'slots': {'CPU': [0, 1, 2, 3, 4, 5], 'GPU': [0], 'RAM': [0, 1]}, 'target': {'cache_path': '/localscratch/zming01.*/cryosparc_cache', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'hostname': 'narval', 'lane': 'narval', 'name': 'narval', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash\n#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH --account=def-xinli808\n#SBATCH --output={{ job_dir_abs }}/output.txt\n#SBATCH --error={{ job_dir_abs }}/error.txt\n{%- if num_gpu == 0 %}\n#SBATCH --ntasks={{ num_cpu }}\n#SBATCH --cpus-per-task=1\n#SBATCH --threads-per-core=1\n{%- else %}\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node={{ num_cpu }}\n#SBATCH --cpus-per-task=1\n#SBATCH --threads-per-core=1\n#SBATCH --gres=gpu:{{ num_gpu }}\n{%- endif %}\n\nmodule load cuda/11.0\nmkdir -p ${SLURM_TMPDIR}/cryosparc_cache\n{{ run_cmd }}\n', 'send_cmd_tpl': '{{ command }}', 'title': 'narval', 'type': 'cluster', 'worker_bin_path': '/project/def-xinli808/zming01/cryosparc/cryosparc_worker/bin/cryosparcw'}}
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
**** handle exception rc
set status to failed
========= main process now complete.
========= monitor process now complete.

Thanks
Zhenhua

wtempel · August 23, 2022, 5:37pm

Do output.txt or error.txt inside
/home/zming01/projects/def-xinli808/zming01/T20S/P2/J6/
hold useful information?

zhenhua · August 23, 2022, 6:01pm

The output.txt file is empty. The error.txt file contains the following contents:

[zming01@narval4 J6]$ cat error.txt

Due to MODULEPATH changes, the following have been reloaded:
  1) libfabric/1.10.1     2) openmpi/4.0.3     3) ucx/1.8.0

slurmstepd: error: Detected 1 oom-kill event(s) in StepId=8821379.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

wtempel · August 23, 2022, 6:53pm

You may need to include a line
#SBATCH --mem={{ ram_gb | int }}G
in your cluster_script.sh and re-run cryosparcm cluster connect.
If you still get out-of-memory errors, you may experiment with multipliers like:
#SBATCH --mem={{ ram_gb * 2 | int }}G
Because a larger memory request may spend more time in the queue, you may want to specify two or more cryoSPARC cluster “lanes” with varying --mem= settings in cluster_script.sh and distinct "name": and "title": values in cluster_info.json (see guide). This way, jobs with smaller memory requirement may spend less time in the queue.

zhenhua · August 23, 2022, 8:21pm

Hi wtempel,
It did work for me and the error has been fixed.
Thanks for the help! I appreciate the cryoSPARC community and look forward to learning more here.
Zhenhua