Motion Correction Error: AssertionError

Hello.

CryoSparc 2.15.0, patch level 200728

We are getting the following error during motion correction:

[CPU: 189.2 MB]
Traceback (most recent call last):
File “cryosparc2_worker/cryosparc2_compute/run.py”, line 85, in cryosparc2_compute.run.main
File “cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/run_patch.py”, line 359, in cryosparc2_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi
AssertionError: Child process with PID 1837234 has terminated unexpectedly!

Hi @yodamoppet,

Hmmm, there’s not a lot of information about why the job failed… Try checking the job’s log for more detailed output. From the command line, run “cryosparcm joblog Px Jy” where x and y are replaced with the job’s project number and job number respectively (example: “cryosparcm joblog P32 J114”).

Is this a brand new installation, or was it previously working?

Harris

This is a new system, new installation. We have run cryosparc on other systems in the past though, so we are familiar with the installation and configuration.

Here is the output from the job log command. Last line repeats many times, but truncated here to save space.

================= CRYOSPARCW =======  2020-09-16 11:13:54.823956  =========
Project P2 Job J3
Master vision.structbio.pitt.edu Port 39002
===========================================================================
========= monitor process now starting main process
MAINPROCESS PID 698874
========= monitor process now waiting for main process
MAIN PID 698874
motioncorrection.run_patch cryosparc2_compute.jobs.jobregister
***************************************************************
Running job on hostname %s vision
Allocated Resources :  {u'lane': u'vision', u'target': {u'lane': u'vision', u'qdel_cmd_tpl': u'scancel {{ cluster_job_id }}', u'name': u'vision', u'title': u'vision', u'hostname': u'vision', u'qstat_cmd_tpl': u'squeue -j {{ cluster_job_id }}', u'worker_bin_path': u'/opt/cryoem/cryosparc/cryosparc2_worker/bin/cryosparcw', u'qinfo_cmd_tpl': u'sinfo', u'qsub_cmd_tpl': u'sbatch {{ script_path_abs }}', u'cache_path': u'/local', u'cache_quota_mb': None, u'script_tpl': u'#!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }}            - the complete command string to run the job\n## {{ num_cpu }}            - the number of CPUs needed\n## {{ num_gpu }}            - the number of GPUs needed. \n##                            Note: the code will use this many GPUs starting from dev id 0\n##                                  the cluster scheduler or this script have the responsibility\n##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n##                                  using the correct cluster-allocated GPUs.\n## {{ ram_gb }}             - the amount of RAM needed in GB\n## {{ job_dir_abs }}        - absolute path to the job directory\n## {{ project_dir_abs }}    - absolute path to the project dir\n## {{ job_log_path_abs }}   - absolute path to the log file for the job\n## {{ worker_bin_path }}    - absolute path to the cryosparc worker command\n## {{ run_args }}           - arguments to be passed to cryosparcw run\n## {{ project_uid }}        - uid of the project\n## {{ job_uid }}            - uid of the job\n## {{ job_creator }}        - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## What follows is a simple SLURM script:\n\n#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH -n {{ num_cpu }}\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH -p defq\n#SBATCH --mem={{ (ram_gb*1000)|int }}MB             \n#SBATCH -o {{ job_dir_abs }}/out.txt\n#SBATCH -e {{ job_dir_abs }}/err.txt\n\navailable_devs=""\nfor devidx in $(seq 0 15);\ndo\n    if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then\n        if [[ -z "$available_devs" ]] ; then\n            available_devs=$devidx\n        else\n            available_devs=$available_devs,$devidx\n        fi\n    fi\ndone\nexport CUDA_VISIBLE_DEVICES=$available_devs\n\n{{ run_cmd }}\n\n', u'cache_reserve_mb': 10000, u'type': u'cluster', u'send_cmd_tpl': u'{{ command }}', u'desc': None}, u'license': True, u'hostname': u'vision', u'slots': {u'GPU': [0, 1], u'RAM': [0, 1, 2, 3], u'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]}, u'fixed': {u'SSD': False}, u'lane_type': u'vision', u'licenses_acquired': 2}
**** handle exception rc
set status to failed
Traceback (most recent call last):
  File "cryosparc2_worker/cryosparc2_compute/run.py", line 85, in cryosparc2_compute.run.main
  File "cryosparc2_master/cryosparc2_compute/jobs/motioncorrection/run_patch.py", line 52, in cryosparc2_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi
  File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/os.py", line 157, in makedirs
    mkdir(name, mode)
OSError: [Errno 17] File exists: '/tank/conwaylab/conway/cryosparc/2018-03-30_CChen-PTang-F34_PolF3MmCcC230np115kx/P2/J3/thumbnails'
========= main process now complete.
========= monitor process now complete.
tail: /tank/conwaylab/conway/cryosparc/2018-03-30_CChen-PTang-F34_PolF3MmCcC230np115kx/P2/J3/job.log: file truncated


================= CRYOSPARCW =======  2020-09-16 11:14:18.004313  =========
Project P2 Job J3
Master vision.structbio.pitt.edu Port 39002
===========================================================================
========= monitor process now starting main process
MAINPROCESS PID 699436
========= monitor process now waiting for main process
MAIN PID 699436
motioncorrection.run_patch cryosparc2_compute.jobs.jobregister
***************************************************************
Running job on hostname %s vision
Allocated Resources :  {u'lane': u'vision', u'target': {u'lane': u'vision', u'qdel_cmd_tpl': u'scancel {{ cluster_job_id }}', u'name': u'vision', u'title': u'vision', u'hostname': u'vision', u'qstat_cmd_tpl': u'squeue -j {{ cluster_job_id }}', u'worker_bin_path': u'/opt/cryoem/cryosparc/cryosparc2_worker/bin/cryosparcw', u'qinfo_cmd_tpl': u'sinfo', u'qsub_cmd_tpl': u'sbatch {{ script_path_abs }}', u'cache_path': u'/local', u'cache_quota_mb': None, u'script_tpl': u'#!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }}            - the complete command string to run the job\n## {{ num_cpu }}            - the number of CPUs needed\n## {{ num_gpu }}            - the number of GPUs needed. \n##                            Note: the code will use this many GPUs starting from dev id 0\n##                                  the cluster scheduler or this script have the responsibility\n##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n##                                  using the correct cluster-allocated GPUs.\n## {{ ram_gb }}             - the amount of RAM needed in GB\n## {{ job_dir_abs }}        - absolute path to the job directory\n## {{ project_dir_abs }}    - absolute path to the project dir\n## {{ job_log_path_abs }}   - absolute path to the log file for the job\n## {{ worker_bin_path }}    - absolute path to the cryosparc worker command\n## {{ run_args }}           - arguments to be passed to cryosparcw run\n## {{ project_uid }}        - uid of the project\n## {{ job_uid }}            - uid of the job\n## {{ job_creator }}        - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## What follows is a simple SLURM script:\n\n#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH -n {{ num_cpu }}\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH -p defq\n#SBATCH --mem={{ (ram_gb*1000)|int }}MB             \n#SBATCH -o {{ job_dir_abs }}/out.txt\n#SBATCH -e {{ job_dir_abs }}/err.txt\n\navailable_devs=""\nfor devidx in $(seq 0 15);\ndo\n    if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then\n        if [[ -z "$available_devs" ]] ; then\n            available_devs=$devidx\n        else\n            available_devs=$available_devs,$devidx\n        fi\n    fi\ndone\nexport CUDA_VISIBLE_DEVICES=$available_devs\n\n{{ run_cmd }}\n\n', u'cache_reserve_mb': 10000, u'type': u'cluster', u'send_cmd_tpl': u'{{ command }}', u'desc': None}, u'license': True, u'hostname': u'vision', u'slots': {u'GPU': [0, 1], u'RAM': [0, 1, 2, 3], u'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]}, u'fixed': {u'SSD': False}, u'lane_type': u'vision', u'licenses_acquired': 2}
Process Process-1:2:
Traceback (most recent call last):
  File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
    self.run()
  File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "cryosparc2_compute/jobs/pipeline.py", line 155, in process_work_simple
    process_setup(proc_idx) # do any setup you want on a per-process basis
  File "cryosparc2_master/cryosparc2_compute/jobs/motioncorrection/run_patch.py", line 80, in cryosparc2_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.process_setup
  File "cryosparc2_compute/engine/__init__.py", line 8, in <module>
    from engine import *
  File "cryosparc2_worker/cryosparc2_compute/engine/engine.py", line 4, in init cryosparc2_compute.engine.engine
  File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/pycuda/driver.py", line 62, in <module>
    from pycuda._driver import *  # noqa
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory
Process Process-1:1:
Traceback (most recent call last):
  File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
    self.run()
  File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "cryosparc2_compute/jobs/pipeline.py", line 155, in process_work_simple
    process_setup(proc_idx) # do any setup you want on a per-process basis
  File "cryosparc2_master/cryosparc2_compute/jobs/motioncorrection/run_patch.py", line 80, in cryosparc2_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.process_setup
  File "cryosparc2_compute/engine/__init__.py", line 8, in <module>
    from engine import *
  File "cryosparc2_worker/cryosparc2_compute/engine/engine.py", line 4, in init cryosparc2_compute.engine.engine
  File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/pycuda/driver.py", line 62, in <module>
    from pycuda._driver import *  # noqa
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory
Traceback (most recent call last):
  File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/multiprocessing/queues.py", line 268, in _feed
    send(obj)
IOError: [Errno 32] Broken pipe
Traceback (most recent call last):
  File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/multiprocessing/queues.py", line 268, in _feed
    send(obj)
IOError: [Errno 32] Broken pipe
Traceback (most recent call last):
  File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/multiprocessing/queues.py", line 268, in _feed
    send(obj)
IOError: [Errno 32] Broken pipe
Traceback (most recent call last):
  File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/multiprocessing/queues.py", line 268, in _feed
    send(obj)
IOError: [Errno 32] Broken pipe
Traceback (most recent call last):

We’ve tried running a bit interactively, and I get a slightly different and perhaps more informative error. This is CUDA 10.2 – should I try an earlier CUDA or is something else going on here?

from . import cublas
  File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/skcuda/misc.py", line 25, in <module>
  File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/skcuda/cublas.py", line 292, in <module>
    _cublas_version = int(_get_cublas_version())
  File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/skcuda/cublas.py", line 285, in _get_cublas_version
    h = cublasCreate()
  File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/skcuda/cublas.py", line 203, in cublasCreate
    cublasCheckStatus(status)
  File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/skcuda/cublas.py", line 179, in cublasCheckStatus
    raise e
cublasNotInitialized

Interesting. My first thought was that perhaps the installation hadn’t been successful (or the CUDA paths weren’t set up correctly), but it sounds like you have a good handle on that. Another possibility is that your GPUs are set in process exclusive mode… Have a look at this response here by my colleage Stephan. LogicError: cuCtxCreate failed: invalid device ordinal
(Though you shouldn’t need to downgrade your CUDA version, as recommended in that other thread). Try that nvidia-smi command and see if that resolves it.

Harris

Thanks for the reply.

The compute mode is set to “default” on all nodes, which should be correct. It is set as “default” on our other systems that work with cryosparc as well…

nvidia-smi --query | grep ‘Compute Mode’
Compute Mode : Default
Compute Mode : Default

You mentioned the CUDA Toolkit path, and I see that that is indeed correct in the cryosparc2_worker/config.sh file.

I also noticed that the other thread you pointed me to mentioned checking the location of CUDA bias libraries, and I do find those in my toolkit path in lib64 (which is the same path in the config.sh file):

libcublasLt.so -> libcublasLt.so.10
libcublasLt.so.10 -> libcublasLt.so.10.2.2.89
libcublasLt.so.10.2 -> libcublasLt.so.10.2.2.89
libcublasLt.so.10.2.2.89
libcublasLt_static.a
libcublas.so -> libcublas.so.10
libcublas.so.10 -> libcublas.so.10.2.2.89
libcublas.so.10.2 -> libcublas.so.10.2.2.89
libcublas.so.10.2.2.89
libcublas_static.a

Would really like to get this working – what else can I look at to dig into this problem? Since skcuda is throwing the error, could there be a problem with that package?

Hi @yodamoppet,

During the installation, some of the packages we rely on (PyCUDA, for example), get compiled - if for whatever reason the CUDA path was not available at install time, that could cause symptoms like this. Try the steps listed by my colleague here: Troubleshooting: T20S extensive workflow patch motion correction failure

Harris