Patch motion correction problem

errinaceus · May 31, 2022, 11:00am

Hi everyone, I have just installed CryoSparc onto the new machine and tried to process tutorial dataset to check if everything works.
I got this traceback:

Launching job on lane default target k306-MS-7522 ...

Running job on master node hostname k306-MS-7522

[CPU: 80.0 MB] Project P1 Job J15 Started

[CPU: 80.0 MB] Master running v3.3.2, worker running v3.3.2

[CPU: 80.4 MB] Working in directory: /media/k306/01D7BF46BBD6EE40/cryosparc/test/P1/J2

[CPU: 80.4 MB] Running on lane default

[CPU: 80.4 MB] Resources allocated:

[CPU: 80.4 MB] Worker: k306-MS-7522

[CPU: 80.4 MB] CPU : [0, 1, 2, 3, 4, 5]

[CPU: 80.4 MB] GPU : [0]

[CPU: 80.4 MB] RAM : [0, 1]

[CPU: 80.4 MB] SSD : False

[CPU: 80.4 MB] --------------------------------------------------------------

[CPU: 80.4 MB] Importing job module for job type patch_motion_correction_multi...

[CPU: 244.6 MB] Job ready to run

[CPU: 244.6 MB] ***************************************************************

[CPU: 245.2 MB] Job will process this many movies: 20

[CPU: 245.5 MB] parent process is 3842

[CPU: 171.2 MB] Calling CUDA init from 3883

[CPU: 246.0 MB] Outputting partial results now...

[CPU: 246.0 MB] Traceback (most recent call last): File "cryosparc_worker/cryosparc_compute/run.py", line 85, in cryosparc_compute.run.main File "cryosparc_worker/cryosparc_compute/jobs/motioncorrection/run_patch.py", line 402, in cryosparc_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi AssertionError: Child process with PID 3883 has terminated unexpectedly!

The GPU is GTX1080Ti, CUDA version was 11.7, I found similar topics in the forum reporting problems with recent CUDA version and installed 11.2 alongside with 11.7, then re-installed CryoSPARC, but the problem persists. Thanks for any ideas.

wtempel · May 31, 2022, 2:07pm

Welcome to the forum @errinaceus.
Please can you share the contents of
/media/k306/01D7BF46BBD6EE40/cryosparc/test/P1/J2/job.log
and provide information on your worker environment:

errinaceus · June 1, 2022, 6:55am

The log file is:

================= CRYOSPARCW =======  2022-05-31 13:01:11.685063  =========
Project P1 Job J2
Master k306-MS-7522 Port 39002
===========================================================================
========= monitor process now starting main process
MAINPROCESS PID 4154
MAIN PID 4154
ctf_estimation.run cryosparc_compute.jobs.jobregister
========= monitor process now waiting for main process
***************************************************************
Running job on hostname %s k306-MS-7522
Allocated Resources :  {'fixed': {'SSD': False}, 'hostname': 'k306-MS-7522', 'lane': 'default', 'lane_type': 'default', 'license': True, 'licenses_acquired': 1, 'slots': {'CPU': [0, 1], 'GPU': [0], 'RAM': [0]}, 'target': {'cache_path': '/media/k306/01D7BF46BBD6EE40/cryosparc_cache', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 11720261632, 'name': 'NVIDIA GeForce GTX 1080 Ti'}], 'hostname': 'k306-MS-7522', 'lane': 'default', 'monitor_port': None, 'name': 'k306-MS-7522', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7], 'GPU': [0], 'RAM': [0, 1, 2]}, 'ssh_str': 'k306@k306-MS-7522', 'title': 'Worker node k306-MS-7522', 'type': 'node', 'worker_bin_path': '/home/k306/cryosparcuser/cryosparc/cryosparc_worker/bin/cryosparcw'}}
**** handle exception rc
set status to failed
========= main process now complete.
========= monitor process now complete.

The worker is installed on the same machine. Strangely I have

~$eval $(/Home/cryosparc_user/cryosparc/cryosparc_worker/bin/cryosparcw env) 
bash: /Home/cryosparc_user/cryosparc/cryosparc_worker/bin/cryosparcw: No such file or directory
~${CRYOSPARC_CUDA_PATH}/bin/nvcc --version
bash: /bin/nvcc: No such file or directory
~$python -c "import pycuda.driver; print(pycuda.driver.get_version())"

Command 'python' not found, did you mean:

  command 'python3' from deb python3
  command 'python' from deb python-is-python3

Does this mean I have something wrong with Python?

Output of uname -a && free -g && nvidia-smi:

Linux k306-MS-7522 5.13.0-44-generic #49~20.04.1-Ubuntu SMP Wed May 18 18:44:28 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
              total        used        free      shared  buff/cache   available
Mem:             23           2           0           0          20          20
Swap:             1           0           1
Wed Jun  1 09:53:36 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.43.04    Driver Version: 515.43.04    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce 1080Ti  On   | 00000000:02:00.0  On |                  N/A |
| 45%   56C    P8    16W / 250W |    156MiB / 11264MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1140      G   /usr/lib/xorg/Xorg                 75MiB |
|    0   N/A  N/A      1448      G   /usr/bin/gnome-shell               79MiB |
+-----------------------------------------------------------------------------+

wtempel · June 1, 2022, 1:54pm

The capitalization of “Home” and a missing k306 component would explain

Please can you try
eval $(/home/k306/cryosparcuser/cryosparc/cryosparc_worker/bin/cryosparcw env)
followed by the other commands instead?
Please can you also confirm that the cryoSPARC installation and processes are “owned” by Linux user “k306”:
ls -l /home/k306/cryosparcuser/cryosparc/cryosparc_worker/bin/cryosparcw
ls -l /home/k306/cryosparcuser/cryosparc/cryosparc_master/bin/cryosparcm
ps -ef | grep supervisord

errinaceus · June 1, 2022, 2:18pm

Thanks, indeed the path was incorrect, now got it from pwd command, but this time I have just zero output with
eval $(/home/k306/cryosparc_user/cryosparc/cryosparc_worker/bin/cryosparcw env)

I double checked the file cryosparcw is in place

However now I have

${CRYOSPARC_CUDA_PATH}/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

As for the process ownership, I have:

ls -l /home/k306/cryosparcuser/cryosparc/cryosparc_worker/bin/cryosparcw
-rwxrwxr-x 1 k306 k306 13677 apr  8 23:44 /home/k306/cryosparcuser/cryosparc/cryosparc_worker/bin/cryosparcw
k306@k306-MS-7522:~/cryosparcuser/cryosparc/cryosparc_master$ ls -l /home/k306/cryosparcuser/cryosparc/cryosparc_master/bin/cryosparcm
-rwxrwxr-x 1 k306 k306 58488 apr  8 23:43 /home/k306/cryosparcuser/cryosparc/cryosparc_master/bin/cryosparcm
k306@k306-MS-7522:~/cryosparcuser/cryosparc/cryosparc_master$ ps -ef | grep supervisord
k306       59893    1015  0 17:09 ?        00:00:00 python /home/k306/cryosparcuser/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/supervisord -c /home/k306/cryosparcuser/cryosparc/cryosparc_master/supervisord.conf
k306       60309   59763  0 17:15 pts/0    00:00:00 grep --color=auto supervisord

errinaceus · June 1, 2022, 2:21pm

The execution of cryosparcw env
return this:

export "CRYOSPARC_USE_GPU=true"
export "CRYOSPARC_PATH=/home/k306/cryosparcuser/cryosparc/cryosparc_worker/bin"
export "CRYOSPARC_ROOT_DIR=/home/k306/cryosparcuser/cryosparc/cryosparc_worker"
export "CRYOSPARC_LICENSE_ID=XXXXXXXXXXXXXXXXXXXXX"
export "CRYOSPARC_CUDA_PATH=/usr/local/cuda"
export "CRYOSPARC_DEVELOP=false"
export "CRYOSPARC_CONDA_ENV=cryosparc_worker_env"
export "PATH=/home/k306/cryosparcuser/cryosparc/cryosparc_worker/bin:/home/k306/cryosparcuser/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/bin:/home/k306/cryosparcuser/cryosparc/cryosparc_worker/deps/anaconda/condabin:/usr/local/cuda/bin:/home/k306/cryosparcuser/cryosparc/cryosparc_worker/bin:/home/k306/cryosparcuser/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/bin:/home/k306/cryosparcuser/cryosparc/cryosparc_worker/deps/anaconda/condabin:/usr/local/cuda/bin:/home/k306/cryosparcuser/cryosparc/cryosparc_worker/bin:/home/k306/cryosparcuser/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/bin:/home/k306/cryosparcuser/cryosparc/cryosparc_worker/deps/anaconda/condabin:/usr/local/cuda/bin:/home/k306/cryosparcuser/cryosparc/cryosparc_master/bin:/home/k306/cryosparcuser/cryosparc/cryosparc_master/bin:/home/k306/soft/PyMOL-2.5.2_293-Linux-x86_64-py37/pymol:/usr/local/gromacs/bin/GMXRC:/home/k306/soft/NAMD_3.0alpha11_Linux-x86_64-multicore-CUDA:/home/k306/soft/vmd-1.9.3/bin/vmd:/home/k306/soft/ccpem-1.6.0-linux-x86_64/ccpem-1.6.0/setup_ccpem.sh:/usr/local/phenix-1.20.1-4487/phenix_env.sh:/home/k306/ccp4-8.0.001-shelx-arpwarp-linux64/ccp4-8.0/bin/ccp4.setup-sh:/home/k306/ccp4-8.0.001-shelx-arpwarp-linux64/ccp4-8.0/bin:/home/k306/cryosparcuser/cryosparc/cryosparc_master/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin"
export "LD_LIBRARY_PATH=/home/k306/cryosparcuser/cryosparc/cryosparc_worker/cryosparc_compute/blobio:/home/k306/cryosparcuser/cryosparc/cryosparc_worker/cryosparc_compute/libs:/home/k306/cryosparcuser/cryosparc/cryosparc_worker/deps/external/cudnn/lib:/usr/local/cuda/lib64:/home/k306/cryosparcuser/cryosparc/cryosparc_worker/cryosparc_compute/blobio:/home/k306/cryosparcuser/cryosparc/cryosparc_worker/cryosparc_compute/libs:/home/k306/cryosparcuser/cryosparc/cryosparc_worker/deps/external/cudnn/lib:/usr/local/cuda/lib64:/home/k306/cryosparcuser/cryosparc/cryosparc_worker/cryosparc_compute/blobio:/home/k306/cryosparcuser/cryosparc/cryosparc_worker/cryosparc_compute/libs:/home/k306/cryosparcuser/cryosparc/cryosparc_worker/deps/external/cudnn/lib:/usr/local/cuda/lib64"
export "LD_PRELOAD=/home/k306/cryosparcuser/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/libpython3.7m.so:/home/k306/cryosparcuser/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/libtiff.so"
export "PYTHONPATH=/home/k306/cryosparcuser/cryosparc/cryosparc_worker"
export "PYTHONNOUSERSITE=true"
export "CONDA_EXE=/home/k306/cryosparcuser/cryosparc/cryosparc_worker/deps/anaconda/bin/conda"
export "CONDA_PREFIX=/home/k306/cryosparcuser/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env"
export "CONDA_PROMPT_MODIFIER=(cryosparc_worker_env)"
export "CONDA_SHLVL=1"
export "CONDA_PYTHON_EXE=/home/k306/cryosparcuser/cryosparc/cryosparc_worker/deps/anaconda/bin/python"
export "CONDA_DEFAULT_ENV=cryosparc_worker_env"

wtempel · June 1, 2022, 3:24pm

No output is expected when that command is successful, which should load the cryoSPARC worker environment into the current shell.
I am surprised about the finding of

Without any changes to the CUDA configuration, please can you run

eval $(/home/k306/cryosparc_user/cryosparc/cryosparc_worker/bin/cryosparcw env)
python -c "import pycuda.driver; print(pycuda.driver.get_version())"

and report the output of the second command.

errinaceus · June 2, 2022, 6:52am

The output is:
(11, 7, 0)

wtempel · June 2, 2022, 2:12pm

This indicates a version mismatch with

I noticed that you

I do not know the path of that CUDA-11.2 installation, but let’s assume it is /usr/local/cuda-11.2. In that case, I would suggest (substitute actual path to cuda-11.2 on your computer for steps 1 and 2):

edit the line in cryosparc_worker/config.sh that begins with
export CRYOSPARC_CUDA_PATH= so that it becomes
export CRYOSPARC_CUDA_PATH=/usr/local/cuda-11.2
“register” cuda-11.2 with the cryoSPARC worker (one-line command):
/home/k306/cryosparcuser/cryosparc/cryosparc_worker/bin/cryosparcw newcuda /usr/local/cuda-11.2
attempt to run a clone of the Patch Motion Correction job

errinaceus · June 6, 2022, 8:06am

I did all these commands, pycuda version is (11, 2, 0) now, but the motion correction still not running with the same mistakes

wtempel · June 6, 2022, 3:11pm

At this point, I suggest the following:

Please install the latest patch (patching documentation):
cryosparcm patch
The patch is expected not to resolve the issue, but to result in more meaningful error messages.
Again capture cuda configuration (two lines of commands)

eval $(/home/k306/cryosparcuser/cryosparc/cryosparc_worker/bin/cryosparcw env)
(date && $CRYOSPARC_CUDA_PATH/bin/nvcc --version && python -c "import pycuda.driver; print(pycuda.driver.get_
version())") | tee /tmp/cuda_info.txt

Run a clone of the failed Patch_Motion_Correction job.
Paste error messages from the Overview tab of that (presumably failing) job into a response under this forum topic
I will send you a direct message about where you may send /tmp/cuda_info.txt as well as (from the new job’s directory) job.json and job.log.

errinaceus · June 7, 2022, 6:54am

Thanks, the patch seemingly resolved the issue! The job was completed. However I had an error message, thus anyway here is run log:
…[CPU: 170.6 MB] Calling CUDA init from 21362

[CPU: 246.2 MB] Child process with PID 21362 terminated unexpectedly with exit code 1.

[CPU: 246.2 MB] --------------------------------------------------------------

[CPU: 246.2 MB] Compiling job outputs…

[CPU: 246.2 MB] Passing through outputs for output group micrographs from input group movies

[CPU: 246.2 MB] This job outputted results [‘micrograph_blob_non_dw’, ‘micrograph_thumbnail_blob_1x’, ‘micrograph_thumbnail_blob_2x’, ‘micrograph_blob’, ‘background_blob’, ‘rigid_motion’, ‘spline_motion’]

[CPU: 246.2 MB] Loaded output dset with 0 items

[CPU: 246.2 MB] Passthrough results [‘movie_blob’, ‘gain_ref_blob’, ‘mscope_params’]

[CPU: 246.2 MB] Loaded passthrough dset with 20 items

[CPU: 246.2 MB] Intersection of output and passthrough has 0 items

[CPU: 246.2 MB] Passing through outputs for output group micrographs_incomplete from input group movies

[CPU: 246.2 MB] This job outputted results [‘micrograph_blob’]

[CPU: 246.2 MB] Loaded output dset with 20 items

[CPU: 246.2 MB] Passthrough results [‘movie_blob’, ‘gain_ref_blob’, ‘mscope_params’]

[CPU: 246.2 MB] Loaded passthrough dset with 20 items

[CPU: 246.8 MB] Intersection of output and passthrough has 20 items

[CPU: 246.8 MB] Checking outputs for output group micrographs

[CPU: 246.8 MB] Checking outputs for output group micrographs_incomplete

[CPU: 246.8 MB] Updating job size…

[CPU: 246.8 MB] Exporting job and creating csg files…

[CPU: 246.9 MB] ***************************************************************
[CPU: 246.9 MB] Job complete. Total time 30.43s

Current CUDA configuration is:
Tue 07 Jun 2022 09:40:42 CET
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Nov_30_19:08:53_PST_2020
Cuda compilation tools, release 11.2, V11.2.67
Build cuda_11.2.r11.2/compiler.29373293_0

errinaceus · June 8, 2022, 6:47am

Actually no images were processed despite the job was completed. Still something wrong with CUDA

wtempel · June 8, 2022, 4:10pm

The job.log file you sent us in includes a 'CUDA driver library not found' error. Please check if the scenario (and its resolution) described by another user under this related topic applies in your case.

errinaceus · June 9, 2022, 7:05am

Thanks, indeed the problem was with this library! In my case I found the library libcuda.so in the …/cuda-11.2/lib64/stubs/ and just copied it in the parent directory …/cuda-11.2/lib64/

wtempel · June 9, 2022, 11:01am

Copying the library from stubs/ is discouraged according to this source. Have you confirmed that patch motion correction and follow-up processing steps now run correctly?

errinaceus · June 9, 2022, 11:24am

I only tried the patch motion correction, it seems to run correctly. I also have another libcuda.so in …/cuda-11.2/targets/x86_64-linux/lib/stubs/libcuda.so like that described in related topic Possibly the right way is to create a soft link to this file?

wtempel · December 15, 2022, 7:39pm

I am not sure. Some users have reported success with this strategy, another source appears to discourage it.
My recommendation (as of December 2022, my views are still evolving):

Do not copy or otherwise link to files inside the stubs directory.
Perform installation of the cuda toolkit and other cuda-related installation steps on a worker with the nvidia drivers installed. This should ensure the availability of a non-stub libcuda.so. On ubuntu-22.04 (updated), for example, /usr/lib/x86_64-linux-gnu/libcuda.so is included in the libnvidia-compute package, on which the nvidia-driver package depends. In case the software installation is to be shared between workers that have different versions of the nvidia driver installed, I would try to perform installation steps on the worker with the oldest version of the drivers, but I have not thoroughly tested this scenario.