Topaz train uses only CPU

Hi, I ran Topaz on cluster (each node contains 2x GPUs), but it seems to only use CPU (see below). It takes a very very long time to finish. I set “Expected number of particles” to 100 and kept other options default. Anyone could help me to figure it out? Thank you!

[CPU: 162.9 MB] THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1544176307774/work/torch/csrc/cuda/Module.cpp line=34 error=30 : unknown error

[CPU: 162.9 MB] # Loading model: resnet8

[CPU: 162.9 MB] # Model parameters: units=32, dropout=0.0, bn=on

[CPU: 162.9 MB] # Receptive field: 71

[CPU: 162.9 MB] CudaWarning: cuda runtime error (30) : unknown error at /opt/conda/conda-bld/pytorch_1544176307774/work/torch/csrc/cuda/Module.cpp:34

[CPU: 162.9 MB] Falling back to CPU.

[CPU: 162.9 MB] # Using device=0 with cuda=False

[CPU: 162.9 MB] # Loaded 32 training micrographs with 2660 labeled particles

[CPU: 162.9 MB] # Loaded 7 test micrographs with 584 labeled particles

[CPU: 162.9 MB] # source split p_observed num_positive_regions total_regions

[CPU: 162.9 MB] # 0 train 0.000409 77140 188559360

[CPU: 162.9 MB] # 0 test 0.000411 16936 41247360

[CPU: 162.9 MB] # Specified expected number of particle per micrograph = 100.0

[CPU: 162.9 MB] # With radius = 3

[CPU: 162.9 MB] # Setting pi = 0.0004921527098946454

[CPU: 162.9 MB] # minibatch_size=128, epoch_size=5000, num_epochs=10

[CPU: 162.9 MB] # Done!

[CPU: 162.9 MB] Training command complete.

Hi @jianhaoc,
Which cryoSPARC version?

The version shown near “Chang Log” title is “Current version: v2.15.1-live_privatebeta”.

The error message

CudaWarning: cuda runtime error (30) : unknown error at /opt/conda/conda-bld/pytorch_1544176307774/work/torch/csrc/cuda/Module.cpp:34

Indicates that something is wrong with the NVIDIA driver on the cluster node. Can you log into the cluster node (as the same user that runs cryoSPARC) and run nvidia-smi

Hi @alexjamesnoble @apunjani
Same error with CS3.2 and topaz0.2.3 (Installed using conda install topaz=0.2.3 cudatoolkit=10.1 -c tbepler -c pytorch -c conda-forge).
I checked installation and cuda binding (As mentioned here: Start Locally | PyTorch). It looks ok to me.

[user@gpu-01 ~]$ python3
Python 3.7.4 (default, Mar 31 2020, 10:25:18) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> x = torch.rand(5, 3)
>>> print(x)
tensor([[0.5759, 0.5849, 0.7097],
        [0.8814, 0.0329, 0.0948],
        [0.8039, 0.6324, 0.7361],
        [0.7562, 0.1569, 0.1055],
        [0.0525, 0.8304, 0.4985]])
>>> 
>>> import torch
>>> torch.cuda.is_available()
True

Please advise.
Thanks