Patch motion cor and Motioncor2 failed

idiaz · April 29, 2021, 10:46am

Hi all!
I am trying to run patch motion cor on cryosparc v.3.2.0, CUDA version is 10.1 and I’m running it in a cluster using GPU (32 cores with HT, CPU 3.2GHz Xeon E5-2667 v4, GPU GeForce GTX 1080 Ti - 11GB, 256GB RAM) but I also try CPU (Dell PE M630, 24 cores with HT, 3.2GHz Xeon E5-2667 v3, 64GB RAM).

I have the following error.

License is valid.

Launching job on lane CPU-SGE target CPU-SGE ...

Launching job on cluster CPU-SGE


====================== Cluster submission script: ========================
==========================================================================
#!/bin/sh
#$ -V
#$ -N cryosparc_P1_J23
#$ -pe openmpi 1 -l dedicated=24 -A Cryosparc
#$ -e P1/J23/J23.err
#$ -o P1/J23/J23.out
#$ -cwd
#$ -S /bin/bash

export CUDA_VISIBLE_DEVICES=""

soft/cryosparc2/cryosparc_worker/bin/cryosparcw run --project P1 --job J23 --master_hostname hal.lmb.internal --master_command_core_port 39042 > P1/J23/job.log 2>&1 

==========================================================================
==========================================================================

-------- Submission command: 
qsub /P1/J23/queue_sub_script.sh

-------- Cluster Job ID: 
submitted

-------- Queued on cluster at 2021-04-28 18:56:22.750650

Failed to check cluster job status! 1

[CPU: 68.1 MB]   Project P1 Job J23 Started

[CPU: 68.1 MB]   Master running v3.2.0+210413, worker running v3.2.0+210413

[CPU: 68.4 MB]   Running on lane CPU-SGE

[CPU: 68.4 MB]   Resources allocated: 

[CPU: 68.4 MB]     Worker:  CPU-SGE

[CPU: 68.4 MB]     CPU   :  [0, 1, 2, 3, 4, 5]

[CPU: 68.4 MB]     GPU   :  [0]

[CPU: 68.4 MB]     RAM   :  [0, 1]

[CPU: 68.4 MB]     SSD   :  False

[CPU: 68.4 MB]   --------------------------------------------------------------

[CPU: 68.4 MB]   Importing job module for job type patch_motion_correction_multi...

[CPU: 206.5 MB]  Job ready to run

[CPU: 206.5 MB]  ***************************************************************

[CPU: 206.8 MB]  Job will process this many movies:  1079

[CPU: 206.9 MB]  parent process is 110258

[CPU: 163.3 MB]  Calling CUDA init from 110289

[CPU: 209.7 MB]  Outputting partial results now...
[CPU: 210.7 MB]  Traceback (most recent call last):
  File "cryosparc_worker/cryosparc_compute/run.py", line 84, in cryosparc_compute.run.main
  File "cryosparc_worker/cryosparc_compute/jobs/motioncorrection/run_patch.py", line 402, in cryosparc_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi
AssertionError: Child process with PID 110289 has terminated unexpectedly!
''''

CUDA version is 10.1 but I have tried also 9.1 with the same result.

I also have tried to run Motioncor2 with this error:

[CPU: 85.9 MB]   --------------------------------------------------------------

[CPU: 85.9 MB]   Processed 0 of 1079 movies in 0.01s 

[CPU: 85.9 MB]   Raw movie filepath located at: J4/imported/FoilHole_5452167_Data_4501361_4501363_20210406_011954_Fractions.mrcs - creating MotionCor2 command string...

[CPU: 1.30 GB]   Finished creating MotionCor2 command string in 14.84s

[CPU: 1.30 GB]   Starting MotionCor2 process...

[CPU: 1.30 GB]   Running MotionCor2 command: /public/EM/MOTIONCOR2/MotionCor2 -InMrc /P1/J4/imported/FoilHole_5452167_Data_4501361_4501363_20210406_011954_Fractions.mrcs -OutMrc /P1/J21/motioncorrected/012542470421558195703_FoilHole_5452167_Data_4501361_4501363_20210406_011954_Fractions_motioncor2_aligned.mrc -Patch 5.0 5.0 -Kv 300.0 -PixSize 1.52 -FmDose 0.06974359047718537 -Gpu 0 -GpuMemUsage 0.5 -LogFile /P1/J21/motioncor2_logs/0

[CPU: 1.30 GB]   Running process 3251606

[CPU: 1.30 GB]   ERROR motioncor2 failed to produce output file /P1/J21/motioncorrected/012542470421558195703_FoilHole_5452167_Data_4501361_4501363_20210406_011954_Fractions_motioncor2_aligned.mrc

[CPU: 1.30 GB]   Finished MotionCor2 process in 0.33s
[CPU: 1.30 GB]   Traceback (most recent call last):
  File "cryosparc_worker/cryosparc_compute/run.py", line 84, in cryosparc_compute.run.main
  File "soft/cryosparc2/cryosparc_worker/cryosparc_compute/jobs/motioncorrection/run_motioncor2.py", line 364, in run_motioncor2_wrapper
    with open(output_path_abs) as mrc_file:
FileNotFoundError: [Errno 2] No such file or directory: '/P1/J21/motioncorrected/012542470421558195703_FoilHole_5452167_Data_4501361_4501363_20210406_011954_Fractions_motioncor2_aligned.mrc'

I am using the same workspace for all the jobs.
Can anyone help me??
Thanks,
Irene

hsnyder · April 29, 2021, 3:49pm

Hi Irene, is this a new cryosparc installation, or have you been using it successfully for a while?

Thanks,
-Harris

idiaz · April 29, 2021, 4:03pm

HI Harris!
This is a new installation, I’ve never used CS before. So far, import my movies has worked.
Thanks,
Irene

hsnyder · April 29, 2021, 4:06pm

Hi Irene,

Usually when this happens it’s because of an issue with installation and CUDA - motion correction is the first job in the processing pipeline that uses GPUs so that’s where the issue shows up if something isn’t working with CUDA or if the install didn’t complete correctly. My colleague @stephan will be able to help.

–Harris

stephan · April 29, 2021, 4:10pm

HI @idiaz,

Can you send us the job log for both jobs? To get the job log, run the command: cryosparcm joblog P1 J23 in a shell. You can press CTRL+C to exit from “trailing” mode and enable scrolling.
Can you also send us the output of nvidia-smi on the workstation?

idiaz · April 29, 2021, 5:07pm

Hi Stephan,
The job log for patch motion cor is:

 ================= CRYOSPARCW =======  2021-04-28 18:56:24.813989  =========
Project P1 Job J23
Master hal.lmb.internal Port 39042
===========================================================================
========= monitor process now starting main process
MAINPROCESS PID 110258
MAIN PID 110258
motioncorrection.run_patch cryosparc_compute.jobs.jobregister
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "cryosparc_worker/cryosparc_compute/run.py", line 172, in cryosparc_compute.run.run
  File "/lmb/home/idiaz/soft/cryosparc2/cryosparc_worker/cryosparc_compute/jobs/runcommon.py", line 1904, in get_gpu_info
    import pycuda.driver as cudrv
  File "/lmb/home/idiaz/soft/cryosparc2/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/pycuda/driver.py", line 62, in <module>
    from pycuda._driver import *  # noqa
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory
***************************************************************
Running job on hostname %s CPU-SGE
Allocated Resources :  {'fixed': {'SSD': False}, 'hostname': 'CPU-SGE', 'lane': 'CPU-SGE', 'lane_type': 'CPU-SGE', 'license': True, 'licenses_acquired': 1, 'slots': {'CPU': [0, 1, 2, 3, 4, 5], 'GPU': [0], 'RAM': [0, 1]}, 'target': {'cache_path': '', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': 'CPU nodes (24 cores, 64Gb RAM)', 'hostname': 'CPU-SGE', 'lane': 'CPU-SGE', 'name': 'CPU-SGE', 'qdel_cmd_tpl': 'qdel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'qsummary', 'qstat_cmd_tpl': 'qstat -j {{ cluster_job_id }}', 'qsub_cmd_tpl': 'qsub {{ script_path_abs }}', 'script_tpl': '#!/bin/sh\n#$ -V\n#$ -N cryosparc_{{ project_uid }}_{{ job_uid }}\n#$ -pe openmpi 1 -l dedicated=24 -A Cryosparc\n#$ -e {{ job_dir_abs }}/{{ job_uid }}.err\n#$ -o {{ job_dir_abs }}/{{ job_uid }}.out\n#$ -cwd\n#$ -S /bin/bash\n\nexport CUDA_VISIBLE_DEVICES=""\n\n{{ run_cmd }}\n\n', 'send_cmd_tpl': '{{ command }}', 'title': 'CPU nodes (SGE)', 'type': 'cluster', 'worker_bin_path': '/lmb/home/idiaz/soft/cryosparc2/cryosparc_worker/bin/cryosparcw'}}
Process Process-1:1:
Traceback (most recent call last):
  File "/lmb/home/idiaz/soft/cryosparc2/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/lmb/home/idiaz/soft/cryosparc2/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/lmb/home/idiaz/soft/cryosparc2/cryosparc_worker/cryosparc_compute/jobs/pipeline.py", line 176, in process_work_simple
    process_setup(proc_idx) # do any setup you want on a per-process basis
  File "cryosparc_worker/cryosparc_compute/jobs/motioncorrection/run_patch.py", line 81, in cryosparc_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.process_setup
  File "/lmb/home/idiaz/soft/cryosparc2/cryosparc_worker/cryosparc_compute/engine/__init__.py", line 8, in <module>
    from .engine import *  # noqa
  File "cryosparc_worker/cryosparc_compute/engine/engine.py", line 9, in init cryosparc_compute.engine.engine
  File "cryosparc_worker/cryosparc_compute/engine/cuda_core.py", line 4, in init cryosparc_compute.engine.cuda_core
  File "/lmb/home/idiaz/soft/cryosparc2/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/pycuda/driver.py", line 62, in <module>
    from pycuda._driver import *  # noqa
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory
**** handle exception rc
set status to failed

The joblog for the Motioncor2 is:

Lmod has detected the following error: The following module(s) are unknown:
"cuda/10.1.168"

Please check the spelling or version number. Also try "module spider ..."
It is also possible your cache file is out-of-date; it may help to try:
  $ module --ignore-cache load "cuda/10.1.168"

Also make sure that all modulefiles written in TCL start with the string
#%Module



/public/EM/MOTIONCOR2//MotionCor2_v1.3.1-Cuda101: error while loading shared libraries: libcuda.so.1: cannot open shared object file: No such file or directory
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@***************************************************************
Running job on hostname %s GPU-SGE
Allocated Resources :  {'fixed': {'SSD': False}, 'hostname': 'GPU-SGE', 'lane': 'GPU-SGE', 'lane_type': 'GPU-SGE', 'license': False, 'licenses_acquired': 0, 'slots': {'CPU': [0], 'GPU': [0], 'RAM': [0]}, 'target': {'cache_path': '/ssd', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': 'GPU nodes (32 cores, 256Gb RAM, 4x GTX 1080 Ti 11Gb)', 'hostname': 'GPU-SGE', 'lane': 'GPU-SGE', 'name': 'GPU-SGE', 'qdel_cmd_tpl': 'qdel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'gqsummary', 'qstat_cmd_tpl': 'qstat -j {{ cluster_job_id }}', 'qsub_cmd_tpl': 'qsub -terse {{ script_path_abs }}', 'script_tpl': '#!/bin/sh\n#$ -V\n#$ -N cryosparc_{{ project_uid }}_{{ job_uid }}\n#$ -pe openmpi 1 -l dedicated=24 -A Cryosparc\n#$ -e {{ job_dir_abs }}/{{ job_uid }}.err\n#$ -o {{ job_dir_abs }}/{{ job_uid }}.out\n#$ -cwd\n#$ -S /bin/bash\n\nexport CUDA_VISIBLE_DEVICES=""\n\n{{ run_cmd }}\n\n', 'send_cmd_tpl': '{{ command }}', 'title': 'GPU nodes (SGE)', 'type': 'cluster', 'worker_bin_path': '/lmb/home/idiaz/soft/cryosparc2/cryosparc_worker/bin/cryosparcw'}}
**** handle exception rc
set status to failed

~

It is true that it seems to be a problem with CUDA. The nvidia-smi output is

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82       Driver Version: 440.82       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN V             Off  | 00000000:41:00.0 Off |                  N/A |
| 28%   41C    P2    27W / 250W |   1880MiB / 12066MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN V             Off  | 00000000:83:00.0 Off |                  N/A |
| 28%   35C    P8    23W / 250W |     12MiB / 12066MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0   2462771      C   python                                      1862MiB |
+-----------------------------------------------------------------------------+

I thought I was using CUDA 10.1 but it turns out it’s 10.2.

Thanks,
Irene

stephan · April 29, 2021, 5:52pm

Hey @idiaz,

Definitely seems like it. Are you able to update to the latest CUDA Toolkit version as well as the latest NVIDIA Driver version for your GPUs?

idiaz · April 30, 2021, 9:57am

Hi stephan,
I don’t think I’m able to do that as this is a institute cluster what I’m using.
But I have the feeling something is wrong in my installation because colleges form my lab are able to use CS with the same CUDA version.
Thanks,
Irene

stephan · April 30, 2021, 1:27pm

Hi @idiaz,

If that’s the case, then do the following on a worker node (a machine with GPUs):

Navigate to cryosparc_worker
Run the command ./bin/cryosparcw <path to CUDA>

This will uninstall and reinstall pyCUDA, which might fix any problems that came up during the initial installation.