Motion Correction Error: AssertionError

yodamoppet · September 14, 2020, 4:41pm

Hello.

CryoSparc 2.15.0, patch level 200728

We are getting the following error during motion correction:

[CPU: 189.2 MB]
Traceback (most recent call last):
File “cryosparc2_worker/cryosparc2_compute/run.py”, line 85, in cryosparc2_compute.run.main
File “cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/run_patch.py”, line 359, in cryosparc2_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi
AssertionError: Child process with PID 1837234 has terminated unexpectedly!

hsnyder · September 16, 2020, 3:10pm

Hi @yodamoppet,

Hmmm, there’s not a lot of information about why the job failed… Try checking the job’s log for more detailed output. From the command line, run “cryosparcm joblog Px Jy” where x and y are replaced with the job’s project number and job number respectively (example: “cryosparcm joblog P32 J114”).

Is this a brand new installation, or was it previously working?

Harris

yodamoppet · September 16, 2020, 3:16pm

This is a new system, new installation. We have run cryosparc on other systems in the past though, so we are familiar with the installation and configuration.

Here is the output from the job log command. Last line repeats many times, but truncated here to save space.

================= CRYOSPARCW =======  2020-09-16 11:13:54.823956  =========
Project P2 Job J3
Master vision.structbio.pitt.edu Port 39002
===========================================================================
========= monitor process now starting main process
MAINPROCESS PID 698874
========= monitor process now waiting for main process
MAIN PID 698874
motioncorrection.run_patch cryosparc2_compute.jobs.jobregister
***************************************************************
Running job on hostname %s vision
Allocated Resources :  {u'lane': u'vision', u'target': {u'lane': u'vision', u'qdel_cmd_tpl': u'scancel {{ cluster_job_id }}', u'name': u'vision', u'title': u'vision', u'hostname': u'vision', u'qstat_cmd_tpl': u'squeue -j {{ cluster_job_id }}', u'worker_bin_path': u'/opt/cryoem/cryosparc/cryosparc2_worker/bin/cryosparcw', u'qinfo_cmd_tpl': u'sinfo', u'qsub_cmd_tpl': u'sbatch {{ script_path_abs }}', u'cache_path': u'/local', u'cache_quota_mb': None, u'script_tpl': u'#!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }}            - the complete command string to run the job\n## {{ num_cpu }}            - the number of CPUs needed\n## {{ num_gpu }}            - the number of GPUs needed. \n##                            Note: the code will use this many GPUs starting from dev id 0\n##                                  the cluster scheduler or this script have the responsibility\n##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n##                                  using the correct cluster-allocated GPUs.\n## {{ ram_gb }}             - the amount of RAM needed in GB\n## {{ job_dir_abs }}        - absolute path to the job directory\n## {{ project_dir_abs }}    - absolute path to the project dir\n## {{ job_log_path_abs }}   - absolute path to the log file for the job\n## {{ worker_bin_path }}    - absolute path to the cryosparc worker command\n## {{ run_args }}           - arguments to be passed to cryosparcw run\n## {{ project_uid }}        - uid of the project\n## {{ job_uid }}            - uid of the job\n## {{ job_creator }}        - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## What follows is a simple SLURM script:\n\n#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH -n {{ num_cpu }}\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH -p defq\n#SBATCH --mem={{ (ram_gb*1000)|int }}MB             \n#SBATCH -o {{ job_dir_abs }}/out.txt\n#SBATCH -e {{ job_dir_abs }}/err.txt\n\navailable_devs=""\nfor devidx in $(seq 0 15);\ndo\n    if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then\n        if [[ -z "$available_devs" ]] ; then\n            available_devs=$devidx\n        else\n            available_devs=$available_devs,$devidx\n        fi\n    fi\ndone\nexport CUDA_VISIBLE_DEVICES=$available_devs\n\n{{ run_cmd }}\n\n', u'cache_reserve_mb': 10000, u'type': u'cluster', u'send_cmd_tpl': u'{{ command }}', u'desc': None}, u'license': True, u'hostname': u'vision', u'slots': {u'GPU': [0, 1], u'RAM': [0, 1, 2, 3], u'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]}, u'fixed': {u'SSD': False}, u'lane_type': u'vision', u'licenses_acquired': 2}
**** handle exception rc
set status to failed
Traceback (most recent call last):
  File "cryosparc2_worker/cryosparc2_compute/run.py", line 85, in cryosparc2_compute.run.main
  File "cryosparc2_master/cryosparc2_compute/jobs/motioncorrection/run_patch.py", line 52, in cryosparc2_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi
  File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/os.py", line 157, in makedirs
    mkdir(name, mode)
OSError: [Errno 17] File exists: '/tank/conwaylab/conway/cryosparc/2018-03-30_CChen-PTang-F34_PolF3MmCcC230np115kx/P2/J3/thumbnails'
========= main process now complete.
========= monitor process now complete.
tail: /tank/conwaylab/conway/cryosparc/2018-03-30_CChen-PTang-F34_PolF3MmCcC230np115kx/P2/J3/job.log: file truncated


================= CRYOSPARCW =======  2020-09-16 11:14:18.004313  =========
Project P2 Job J3
Master vision.structbio.pitt.edu Port 39002
===========================================================================
========= monitor process now starting main process
MAINPROCESS PID 699436
========= monitor process now waiting for main process
MAIN PID 699436
motioncorrection.run_patch cryosparc2_compute.jobs.jobregister
***************************************************************
Running job on hostname %s vision
Allocated Resources :  {u'lane': u'vision', u'target': {u'lane': u'vision', u'qdel_cmd_tpl': u'scancel {{ cluster_job_id }}', u'name': u'vision', u'title': u'vision', u'hostname': u'vision', u'qstat_cmd_tpl': u'squeue -j {{ cluster_job_id }}', u'worker_bin_path': u'/opt/cryoem/cryosparc/cryosparc2_worker/bin/cryosparcw', u'qinfo_cmd_tpl': u'sinfo', u'qsub_cmd_tpl': u'sbatch {{ script_path_abs }}', u'cache_path': u'/local', u'cache_quota_mb': None, u'script_tpl': u'#!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }}            - the complete command string to run the job\n## {{ num_cpu }}            - the number of CPUs needed\n## {{ num_gpu }}            - the number of GPUs needed. \n##                            Note: the code will use this many GPUs starting from dev id 0\n##                                  the cluster scheduler or this script have the responsibility\n##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n##                                  using the correct cluster-allocated GPUs.\n## {{ ram_gb }}             - the amount of RAM needed in GB\n## {{ job_dir_abs }}        - absolute path to the job directory\n## {{ project_dir_abs }}    - absolute path to the project dir\n## {{ job_log_path_abs }}   - absolute path to the log file for the job\n## {{ worker_bin_path }}    - absolute path to the cryosparc worker command\n## {{ run_args }}           - arguments to be passed to cryosparcw run\n## {{ project_uid }}        - uid of the project\n## {{ job_uid }}            - uid of the job\n## {{ job_creator }}        - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## What follows is a simple SLURM script:\n\n#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH -n {{ num_cpu }}\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH -p defq\n#SBATCH --mem={{ (ram_gb*1000)|int }}MB             \n#SBATCH -o {{ job_dir_abs }}/out.txt\n#SBATCH -e {{ job_dir_abs }}/err.txt\n\navailable_devs=""\nfor devidx in $(seq 0 15);\ndo\n    if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then\n        if [[ -z "$available_devs" ]] ; then\n            available_devs=$devidx\n        else\n            available_devs=$available_devs,$devidx\n        fi\n    fi\ndone\nexport CUDA_VISIBLE_DEVICES=$available_devs\n\n{{ run_cmd }}\n\n', u'cache_reserve_mb': 10000, u'type': u'cluster', u'send_cmd_tpl': u'{{ command }}', u'desc': None}, u'license': True, u'hostname': u'vision', u'slots': {u'GPU': [0, 1], u'RAM': [0, 1, 2, 3], u'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]}, u'fixed': {u'SSD': False}, u'lane_type': u'vision', u'licenses_acquired': 2}
Process Process-1:2:
Traceback (most recent call last):
  File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
    self.run()
  File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "cryosparc2_compute/jobs/pipeline.py", line 155, in process_work_simple
    process_setup(proc_idx) # do any setup you want on a per-process basis
  File "cryosparc2_master/cryosparc2_compute/jobs/motioncorrection/run_patch.py", line 80, in cryosparc2_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.process_setup
  File "cryosparc2_compute/engine/__init__.py", line 8, in <module>
    from engine import *
  File "cryosparc2_worker/cryosparc2_compute/engine/engine.py", line 4, in init cryosparc2_compute.engine.engine
  File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/pycuda/driver.py", line 62, in <module>
    from pycuda._driver import *  # noqa
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory
Process Process-1:1:
Traceback (most recent call last):
  File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
    self.run()
  File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "cryosparc2_compute/jobs/pipeline.py", line 155, in process_work_simple
    process_setup(proc_idx) # do any setup you want on a per-process basis
  File "cryosparc2_master/cryosparc2_compute/jobs/motioncorrection/run_patch.py", line 80, in cryosparc2_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.process_setup
  File "cryosparc2_compute/engine/__init__.py", line 8, in <module>
    from engine import *
  File "cryosparc2_worker/cryosparc2_compute/engine/engine.py", line 4, in init cryosparc2_compute.engine.engine
  File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/pycuda/driver.py", line 62, in <module>
    from pycuda._driver import *  # noqa
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory
Traceback (most recent call last):
  File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/multiprocessing/queues.py", line 268, in _feed
    send(obj)
IOError: [Errno 32] Broken pipe
Traceback (most recent call last):
  File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/multiprocessing/queues.py", line 268, in _feed
    send(obj)
IOError: [Errno 32] Broken pipe
Traceback (most recent call last):
  File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/multiprocessing/queues.py", line 268, in _feed
    send(obj)
IOError: [Errno 32] Broken pipe
Traceback (most recent call last):
  File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/multiprocessing/queues.py", line 268, in _feed
    send(obj)
IOError: [Errno 32] Broken pipe
Traceback (most recent call last):

yodamoppet · September 17, 2020, 12:20pm

We’ve tried running a bit interactively, and I get a slightly different and perhaps more informative error. This is CUDA 10.2 – should I try an earlier CUDA or is something else going on here?

from . import cublas
  File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/skcuda/misc.py", line 25, in <module>
  File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/skcuda/cublas.py", line 292, in <module>
    _cublas_version = int(_get_cublas_version())
  File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/skcuda/cublas.py", line 285, in _get_cublas_version
    h = cublasCreate()
  File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/skcuda/cublas.py", line 203, in cublasCreate
    cublasCheckStatus(status)
  File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/skcuda/cublas.py", line 179, in cublasCheckStatus
    raise e
cublasNotInitialized

hsnyder · September 17, 2020, 5:03pm

Interesting. My first thought was that perhaps the installation hadn’t been successful (or the CUDA paths weren’t set up correctly), but it sounds like you have a good handle on that. Another possibility is that your GPUs are set in process exclusive mode… Have a look at this response here by my colleage Stephan. LogicError: cuCtxCreate failed: invalid device ordinal
(Though you shouldn’t need to downgrade your CUDA version, as recommended in that other thread). Try that nvidia-smi command and see if that resolves it.

Harris

yodamoppet · September 17, 2020, 6:59pm

Thanks for the reply.

The compute mode is set to “default” on all nodes, which should be correct. It is set as “default” on our other systems that work with cryosparc as well…

nvidia-smi --query | grep ‘Compute Mode’
Compute Mode : Default
Compute Mode : Default

You mentioned the CUDA Toolkit path, and I see that that is indeed correct in the cryosparc2_worker/config.sh file.

I also noticed that the other thread you pointed me to mentioned checking the location of CUDA bias libraries, and I do find those in my toolkit path in lib64 (which is the same path in the config.sh file):

libcublasLt.so -> libcublasLt.so.10
libcublasLt.so.10 -> libcublasLt.so.10.2.2.89
libcublasLt.so.10.2 -> libcublasLt.so.10.2.2.89
libcublasLt.so.10.2.2.89
libcublasLt_static.a
libcublas.so -> libcublas.so.10
libcublas.so.10 -> libcublas.so.10.2.2.89
libcublas.so.10.2 -> libcublas.so.10.2.2.89
libcublas.so.10.2.2.89
libcublas_static.a

Would really like to get this working – what else can I look at to dig into this problem? Since skcuda is throwing the error, could there be a problem with that package?

hsnyder · September 18, 2020, 4:19pm

Hi @yodamoppet,

During the installation, some of the packages we rely on (PyCUDA, for example), get compiled - if for whatever reason the CUDA path was not available at install time, that could cause symptoms like this. Try the steps listed by my colleague here: Troubleshooting: T20S extensive workflow patch motion correction failure

Harris

yodamoppet · September 25, 2020, 12:13pm

Hi @hsnyder , appreciate the guidance.

I tried the steps you outline, and it seems to build fine with no error (though I’m still getting a process error, see below):

******* CRYOSPARC SYSTEM: WORKER INSTALLER ***********************

 Installation Settings:
   Root Directory          : /opt/cryoem/cryosparc/cryosparc2_worker
   Standalone Installation : false
   Version                 : v2.15.0

******************************************************************

 CUDA check..
 Found nvidia-smi at /cm/local/apps/cuda/libs/current/bin/nvidia-smi

 CUDA Path was provided as /cm/shared/apps/cuda10.2/toolkit/10.2.89
 Checking CUDA installation...
 Found nvcc at /cm/shared/apps/cuda10.2/toolkit/10.2.89/bin/nvcc
 The above cuda installation will be used but can be changed later.

******************************************************************

 Setting up hard-coded config.sh environment variables

******************************************************************

 Installing all dependencies.

Checking dependencies... 
Dependencies for python have not changed.
Currently checking hash for ctffind
Dependencies for ctffind have not changed.
Currently checking hash for gctf
Dependencies for gctf have not changed.
Completed dependency check. 

******* CRYOSPARC WORKER INSTALLATION COMPLETE *******************

 In order to run processing jobs, you will need to connect this
 worker to a cryoSPARC master.

******************************************************************

I no longer get the same error when running the job, so progress! However, I’m now getting the following error:

min: -99541.428192 max: 99428.298370

min: -282101.059074 max: 281907.378426

min: -102334.039009 max: 102316.375053

**custom thread exception hook caught something

**** handle exception rc

set status to failed

Traceback (most recent call last):

  File "cryosparc2_compute/jobs/runcommon.py", line 1678, in run_with_except_hook

    run_old(*args, **kw)

  File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/threading.py", line 754, in run

    self.__target(*self.__args, **self.__kwargs)

  File "cryosparc2_compute/jobs/pipeline.py", line 165, in thread_work

    work = processor.process(item)

  File "cryosparc2_master/cryosparc2_compute/jobs/motioncorrection/run_patch.py", line 157, in cryosparc2_compute.jobs.motioncorrection.run_patch.run_patch_motion_correc$

  File "cryosparc2_master/cryosparc2_compute/jobs/motioncorrection/run_patch.py", line 160, in cryosparc2_compute.jobs.motioncorrection.run_patch.run_patch_motion_correc$

  File "cryosparc2_master/cryosparc2_compute/jobs/motioncorrection/run_patch.py", line 164, in cryosparc2_compute.jobs.motioncorrection.run_patch.run_patch_motion_correc$

  File "cryosparc2_master/cryosparc2_compute/jobs/motioncorrection/patchmotion.py", line 84, in cryosparc2_compute.jobs.motioncorrection.patchmotion.unbend_motion_correc$

  File "cryosparc2_master/cryosparc2_compute/jobs/motioncorrection/patchmotion.py", line 185, in cryosparc2_compute.jobs.motioncorrection.patchmotion.unbend_motion_corre$

  File "cryosparc2_compute/jobs/motioncorrection/mic_utils.py", line 96, in replace_hot_mov

    mov[:,hotmask] = n.random.poisson(avg_dose_per_frame, size=(N_Z, numhot)).astype(mov.dtype)

  File "mtrand.pyx", line 4188, in mtrand.RandomState.poisson (numpy/random/mtrand/mtrand.c:29325)

ValueError: lam < 0

hsnyder · September 25, 2020, 7:46pm

Hi @yodamoppet,

Is it possible that you entered a negative or zero value for total electron dose in the import movies stage? Or that the movie being processed has 1 or 0 frames? This is probably an issue with parameters or the input data - those would be my best guess.

Harris

yodamoppet · September 28, 2020, 2:50pm

Hi @nsnyder

We tried a different dataset, and it worked! Thanks for all your help sorting this out and getting us going with this system.

Doug