T20s Extensive workflow fails errors on Patch Motion, fails on CTF Estimation

Hi CryoSPARC,

I have installed CryoSPARC 3.3.1 on a small cluster using NVIDIA A100 cards and slurm to manage workers.

When running the standard “T20s Extensive Workflow” the Patch Motion Correction step completes but contains errors.

e.g.

[CPU: 244.7 MB]  Traceback (most recent call last):
  File "cryosparc_worker/cryosparc_compute/run.py", line 85, in cryosparc_compute.run.main
  File "cryosparc_worker/cryosparc_compute/jobs/motioncorrection/run_patch.py", line 52, in cryosparc_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi
  File "/mnt/userdata/jvanschy/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/os.py", line 223, in makedirs
    mkdir(name, mode)
FileExistsError: [Errno 17] File exists: '/mnt/userdata/jvanschy/P1/J6/motioncorrected'

and

[CPU: 272.5 MB]  Error occurred while processing J5/imported/000166890288247831958_14sep05c_00024sq_00003hl_00005es.frames.tif
Traceback (most recent call last):
  File "/mnt/userdata/jvanschy/cryosparc/cryosparc_worker/cryosparc_compute/jobs/pipeline.py", line 59, in exec
    return self.process(item)
  File "cryosparc_worker/cryosparc_compute/jobs/motioncorrection/run_patch.py", line 337, in cryosparc_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
AssertionError: Job is not in running state - worker thread with PID 168753 terminating self.

Marking J5/imported/000166890288247831958_14sep05c_00024sq_00003hl_00005es.frames.tif as incomplete and continuing...

Should I be concerned about these, even thought the job is marked as ‘Completed’ ?

Next the “Patch CTF estimation” starts, however this definitely fails.


Launching job on lane debu target debu ...

Launching job on cluster debu


====================== Cluster submission script: ========================
==========================================================================
#!/bin/bash
#SBATCH --job-name=cryosparc_P1_J8
#SBATCH --partition=batch
#SBATCH --output=/mnt/userdata/jvanschy/P1/J8/job.log
#SBATCH --error=/mnt/userdata/jvanschy/P1/J8/job.log
#SBATCH --nodes=1
#SBATCH --mem=8000M
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=1
#SBATCH --threads-per-core=1
#SBATCH --gres=gpu:1
#SBATCH --gres-flags=enforce-binding

srun /mnt/userdata/jvanschy/cryosparc/cryosparc_worker/bin/cryosparcw run --project P1 --job J8 --master_hostname gerp-qcif-node00 --master_command_core_port 39002 > /mnt/userdata/jvanschy/P1/J8/job.log 2>&1 
==========================================================================
==========================================================================

-------- Submission command: 
sbatch /mnt/userdata/jvanschy/P1/J8/queue_sub_script.sh

-------- Cluster Job ID: 
30

-------- Queued on cluster at 2022-04-08 04:53:43.476672

-------- Job status at 2022-04-08 04:53:43.516076
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                30     batch cryospar jvanschy PD       0:00      1 (None)

[CPU: 80.2 MB]   Project P1 Job J8 Started

[CPU: 80.2 MB]   Project P1 Job J8 Started

[CPU: 80.2 MB]   Master running v3.3.1, worker running v3.3.1

[CPU: 80.2 MB]   Master running v3.3.1, worker running v3.3.1

[CPU: 80.6 MB]   Working in directory: /mnt/userdata/jvanschy/P1/J8

[CPU: 80.6 MB]   Running on lane debu

[CPU: 80.6 MB]   Resources allocated: 

[CPU: 80.6 MB]     Worker:  debu

[CPU: 80.6 MB]   Working in directory: /mnt/userdata/jvanschy/P1/J8

[CPU: 80.6 MB]     CPU   :  [0, 1]

[CPU: 80.6 MB]     GPU   :  [0]

[CPU: 80.6 MB]   Running on lane debu

[CPU: 80.6 MB]     RAM   :  [0]

[CPU: 80.6 MB]   Resources allocated: 

[CPU: 80.6 MB]     SSD   :  False

[CPU: 80.6 MB]     Worker:  debu

[CPU: 80.6 MB]   --------------------------------------------------------------

[CPU: 80.6 MB]     CPU   :  [0, 1]

[CPU: 80.6 MB]   Importing job module for job type patch_ctf_estimation_multi...

[CPU: 80.6 MB]     GPU   :  [0]

[CPU: 80.6 MB]     RAM   :  [0]

[CPU: 80.6 MB]     SSD   :  False

[CPU: 80.6 MB]   --------------------------------------------------------------

[CPU: 80.6 MB]   Importing job module for job type patch_ctf_estimation_multi...

[CPU: 211.2 MB]  Job ready to run

[CPU: 211.2 MB]  ***************************************************************

[CPU: 211.6 MB]  Job ready to run

[CPU: 211.6 MB]  ***************************************************************

[CPU: 211.2 MB]  Traceback (most recent call last):
  File "cryosparc_worker/cryosparc_compute/run.py", line 85, in cryosparc_compute.run.main
  File "cryosparc_worker/cryosparc_compute/jobs/ctf_estimation/run.py", line 42, in cryosparc_compute.jobs.ctf_estimation.run.run
AttributeError: 'NoneType' object has no attribute 'data'


[CPU: 211.6 MB]  Traceback (most recent call last):
  File "cryosparc_worker/cryosparc_compute/run.py", line 85, in cryosparc_compute.run.main
  File "cryosparc_worker/cryosparc_compute/jobs/ctf_estimation/run.py", line 42, in cryosparc_compute.jobs.ctf_estimation.run.run
AttributeError: 'NoneType' object has no attribute 'data'

Any help in resolving this would be greatly appreciated.

Thanks,
Jay.

@ozej8y I would be concerned about the FileExistsError, and also noticed the srun command inside your submission script.
My (untested) guess: srun inside the script could explain FileExistsError.
You may consider reconfiguring your cluster such that srun no longer is included inside you cluster submission script, which I assume is supplied as an argument to sbatch.

Thanks @wtempel. Solved.