Topaz train uses only CPU

Hi, I ran Topaz on cluster (each node contains 2x GPUs), but it seems to only use CPU (see below). It takes a very very long time to finish. I set “Expected number of particles” to 100 and kept other options default. Anyone could help me to figure it out? Thank you!

[CPU: 162.9 MB] THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1544176307774/work/torch/csrc/cuda/Module.cpp line=34 error=30 : unknown error

[CPU: 162.9 MB] # Loading model: resnet8

[CPU: 162.9 MB] # Model parameters: units=32, dropout=0.0, bn=on

[CPU: 162.9 MB] # Receptive field: 71

[CPU: 162.9 MB] CudaWarning: cuda runtime error (30) : unknown error at /opt/conda/conda-bld/pytorch_1544176307774/work/torch/csrc/cuda/Module.cpp:34

[CPU: 162.9 MB] Falling back to CPU.

[CPU: 162.9 MB] # Using device=0 with cuda=False

[CPU: 162.9 MB] # Loaded 32 training micrographs with 2660 labeled particles

[CPU: 162.9 MB] # Loaded 7 test micrographs with 584 labeled particles

[CPU: 162.9 MB] # source split p_observed num_positive_regions total_regions

[CPU: 162.9 MB] # 0 train 0.000409 77140 188559360

[CPU: 162.9 MB] # 0 test 0.000411 16936 41247360

[CPU: 162.9 MB] # Specified expected number of particle per micrograph = 100.0

[CPU: 162.9 MB] # With radius = 3

[CPU: 162.9 MB] # Setting pi = 0.0004921527098946454

[CPU: 162.9 MB] # minibatch_size=128, epoch_size=5000, num_epochs=10

[CPU: 162.9 MB] # Done!

[CPU: 162.9 MB] Training command complete.

Hi @jianhaoc,
Which cryoSPARC version?

The version shown near “Chang Log” title is “Current version: v2.15.1-live_privatebeta”.

The error message

CudaWarning: cuda runtime error (30) : unknown error at /opt/conda/conda-bld/pytorch_1544176307774/work/torch/csrc/cuda/Module.cpp:34

Indicates that something is wrong with the NVIDIA driver on the cluster node. Can you log into the cluster node (as the same user that runs cryoSPARC) and run nvidia-smi

Hi @alexjamesnoble @apunjani
Same error with CS3.2 and topaz0.2.3 (Installed using conda install topaz=0.2.3 cudatoolkit=10.1 -c tbepler -c pytorch -c conda-forge).
I checked installation and cuda binding (As mentioned here: Start Locally | PyTorch). It looks ok to me.

[user@gpu-01 ~]$ python3
Python 3.7.4 (default, Mar 31 2020, 10:25:18) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> x = torch.rand(5, 3)
>>> print(x)
tensor([[0.5759, 0.5849, 0.7097],
        [0.8814, 0.0329, 0.0948],
        [0.8039, 0.6324, 0.7361],
        [0.7562, 0.1569, 0.1055],
        [0.0525, 0.8304, 0.4985]])
>>> 
>>> import torch
>>> torch.cuda.is_available()
True

Please advise.
Thanks

Please can you post the precise error message (including any file path(s) mentioned)?
Can you also describe how you made cryoSPARC find and run cryoSPARC. Did you use a wrapper script?

suggests that you installed (and presumably intend to use) topaz in a virtual environment. Does

correspond to that same virtual environment:
which python3
?

I uninstalled topaz from our GPU01 server and installed newer version 0.24 with updated CS on a new server (gpu02). But the error remains the same. Here are the responses of your questions:

I’m using a wrapper script mentioned in https://guide.cryosparc.com/processing-data/all-job-types-in-cryosparc/deep-picking/topaz.
It’s working nicely except GPU integration.
Here’s my python environment:

[user@gpu-02 ~]$ which python
/usr/local/anaconda3/bin/python
[user@gpu-02 ~]$ /usr/local/anaconda3/bin/python
Python 3.9.7 (default, Sep 16 2021, 13:09:58) 
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

and a simple python3 gives this:

[user@gpu-02 ~]$ python3
Python 3.9.7 (default, Sep 16 2021, 13:09:58) 
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

I didn’t get any error message during installation. All I’m getting is topaz saying the following when I try to run any topaz function on CS:

 CPU: 234.8 MB]  Starting Topaz process using version 0.2.4...

[CPU: 237.7 MB]  Training new model using training movies.

[CPU: 237.7 MB]  Preparing training data...

[CPU: 237.7 MB]  Preparing training data complete in 147038.483s.

[CPU: 108.6 MB]  
Beginning Topaz denoising command by running command /home/user/topaz.sh denoise [MICROGRAPH PATHS EXCLUDED FOR LEGIBILITY] --device 0 --format mrc --normalize --patch-size 1536 --patch-padding 256 --output /media/nfsraid02/Processed/P1/J132/denoised_micrographs --lowpass 1 --gaussian 0 --inv-gaussian 0 --deconv-patch 1 --pixel-cutoff 0 --save-prefix /media/nfsraid02/Processed/P1/J132/model/denoise_model --dir-a /media/nfsraid02/Processed/P1/J132/A_train_dset --dir-b /media/nfsraid02/Processed/P1/J132/B_train_dset --lr 0.001 --batch-size 1 --num-epochs 10 --criteria L2 --crop 800 --num-workers 16 --arch unet --method noise2noise --optim adagrad


[CPU: 110.7 MB]  CudaWarning: module 'torch._C' has no attribute '_cuda_setDevice'

[CPU: 110.7 MB]  Falling back to CPU.

[CPU: 110.7 MB]  # using device=0 with cuda=False

[CPU: 110.7 MB]  # training with 9465 image pairs

[CPU: 110.7 MB]  # validating on 1051 image pairs

Thanks

May I next ask:
Does topaz work with GPU support on this server outside cryoSPARC?
Does /home/user/topaz.sh match the current version of the script verbatim or have there been any adjustments?
Was topaz installed and is run in an activated conda environment?
Could there be any interferences from additional pytorch installations on the system?

I haven’t tried working topaz outside CS. I believe the error is more generalized than CS specific (guessing).
Here’s the script I’m using:

#!/usr/bin/env bash
if command -v conda > /dev/null 2>&1; then
    conda deactivate > /dev/null 2>&1 || true  # ignore any errors
    conda deactivate > /dev/null 2>&1 || true  # ignore any errors
fi
unset _CE_CONDA
unset CONDA_DEFAULT_ENV
unset CONDA_EXE
unset CONDA_PREFIX
unset CONDA_PROMPT_MODIFIER
unset CONDA_PYTHON_EXE
unset CONDA_SHLVL
unset PYTHONPATH
unset LD_PRELOAD
unset LD_LIBRARY_PATH

source /usr/local/anaconda3/conda.sh
conda activate topaz
exec topaz $@

Yes, topaz was installed after activating conda environment.
Pytorch was installed as mentioned in the installation guide. I’m not sure how to change pytorch version and if that’s causing the issue.

This issue persists in CS4.1 too.
Still not sure how to resolve it.

Thanks

Please can you check whether your Topaz installation can perform training with GPU support when you run Topaz directly, outside CryoSPARC.

Any update on this issue. i am having the same issue. It seems that there is some incompatibility between certain version of cudatoolkit and pytorch version.

Please can you confirm that the topaz installation that you wish to use inside CryoSPARC functions outside CryoSPARC using GPUs.

Topaz works well in cs 4.1. Just use cudatoolkit 11.7 and instead of conda forge use -nvidia. This is the solution to all the problem of topaz integration in cryosparc. Basically there is a mismatch between pytorch and cuda version. The above one is for topaz 0.2.5. If it works for all, may be cs team can change the command in cryosparc topaz page.
Happy structure solving!

I had the same problem, and it turned out that my issue was conda somehow installed the CPU-only version of Pytorch.
use command:
conda list | grep “torch”
to check your pytorch version, if it shows cpu_0 at the end of your pytorch, that means it is CPU only version.
Then you will need to reinstall topaz using the following command:
conda install topaz pytorch torchvision torchaudio cudatoolkit=11.1 -c tbepler -c pytorch -c conda-forge -c nvidia
“torchaudio” turns out be the key to install CUDA-supported pytorch…don’t know why…but it worked for me.

1 Like