TOPAZ not using GPU

Hello,

I am using TOPAZ extract on our cluster setup. When I leave defaults I get the error in the image about torch._C, see image below. When I log into the node I see there is no GPU running.

So instead I try to run on CPU only but that seems to also not be working. If I ask for e.g. 6 CPUs and then I log into the node and I see only 1 CPU being used. Agh.

Any help appreciated.

image

@orangeboomerang Please can you post the text of your error message (so users with a similar problem will find this topic in the future) and let us know whether:

  1. other GPU jobs types are running on this worker as expected
  2. you installed topaz in its own conda environment
  3. you are using a wrapper script for CryoSPARC-embedded topaz jobs
  4. topaz jobs give you the same message when you execute via them via command line using the same Path to Topaz executable that you specified for CryoSPARC-embedded Topaz Extract

Hi,

Here is the full output from the GPU job. This should answer clarify things and answer questions.

To answer your final question, yes this command WILL run on standalone TOPAZ, suggesting that something is not okay with the submission script I am using through cryosparc.

Thank you.


====================== Cluster submission script: ========================
==========================================================================
#!/bin/bash
#### cryoSPARC cluster submission script template for SLURM
/cryosparc_v4.0.0_gpu99Master_gpu118Worker_PORT51000_LIC_2dfef1c2/cryosparc_worker/bin/cryosparcw run --project P9 --job J387 --master_hostname gpu99.institute.local --master_command_core_port 51002 > /nfs/science/group/cryosparc_v3.3.2_gpu114Master_gpu118Worker_LIC_2dfef1c2_CUDA_11.0.3/(path)/J387/job.log 2>&1             - the complete command string to run the job
## 8            - the number of CPUs needed
## 1            - the number of GPUs needed.
##                            Note: the code will use this many GPUs starting from dev id 0
##                                  the cluster scheduler or this script have the responsibility
##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up
##                                  using the correct cluster-allocated GPUs.
## 8.0             - the amount of RAM needed in GB
## /nfs/science/group
/cryosparc_v4.0.0_gpu99Master_gpu118Worker_PORT51000_LIC_2dfef1c2/cryosparc_worker/bin/cryosparcw    - absolute path to the cryosparc worker command
## --project P9 --job J387 --master_hostname gpu99.institute.local --master_command_core_port 51002           - arguments to be passed to cryosparcw run
## P9        - uid of the project
## J387            - uid of the job
##
## What follows is a simple SLURM script:


#SBATCH --job-name cs_1_P9_J387
#SBATCH -n 8
#SBATCH --gres=gpu:1
#####SBATCH --mem=128000MB
#SBATCH --mem-per-cpu=11G
#SBATCH -o /nfs/science/group/cryosparc_slurm_outputs/output_P9_J387.txt
#SBATCH -e /nfs/science/group/cryosparc_slurm_outputs/error_P9_J387.txt
#Define the "gpu" partition for GPU-accelerated jobs
#SBATCH --partition=gpu

#Define the GPU architecture (GTX980 in the example, other options are GTX1080Ti, K40)
######SBATCH --constraint=GTX1080Ti
#SBATCH --exclude=gpu227,gpu228,gpu138,gpu150,gpu148
######SBATCH --constraint=buster
#SBATCH --time=96:00:00

module load cuda/11.2.2
module load tensorflow

nvidia-smi

available_devs=""
for devidx in $(seq 0 15);
do
    if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then
        if [[ -z "$available_devs" ]] ; then
            available_devs=$devidx
        else
            available_devs=$available_devs,$devidx
        fi
    fi
done
#export CUDA_VISIBLE_DEVICES=$available_devs
echo $available_devs
echo $CUDA_HOME
echo "$(hostname)"
echo $SLURM_TMPDIR

/usr/bin/nvidia-smi

module list

export CRYOSPARC_SSD_PATH="${SLURM_TMPDIR}"

/nfs/science/group/cryosparc_v4.0.0_gpu99Master_gpu118Worker_PORT51000_LIC_2dfef1c2/cryosparc_worker/bin/cryosparcw run --project P9 --job J387 --master_hostname gpu99.institute.local --master_command_core_port 51002 > /nfs/(path).job.log 2>&1 

==========================================================================
==========================================================================

-------- Submission command: 
sbatch /(path)/queue_sub_script.sh


-------- Queued on cluster at XXXXX

Job J387 Started
[CPU: 96.4 MB]

Master running v4.0.0, worker running v4.0.0
[CPU: 96.7 MB]

Working in directory: (path)
[CPU: 96.7 MB]

Running on lane slurmcluster
[CPU: 96.7 MB]

Resources allocated: 
[CPU: 96.7 MB]

  Worker:  slurmcluster
[CPU: 96.7 MB]

  CPU   :  [0, 1, 2, 3, 4, 5, 6, 7]
[CPU: 96.7 MB]

  GPU   :  [0]
[CPU: 96.7 MB]

  RAM   :  [0]
[CPU: 96.7 MB]

  SSD   :  False
[CPU: 96.7 MB]

--------------------------------------------------------------
[CPU: 96.7 MB]

Importing job module for job type topaz_extract...
[CPU: 243.0 MB]

Job ready to run
[CPU: 243.0 MB]

***************************************************************
[CPU: 243.0 MB]

Topaz is a particle detection tool created by Tristan Bepler and Alex J. Noble.
Citations:
- Bepler, T., Morin, A., Rapp, M. et al. Positive-unlabeled convolutional neural networks for particle picking in cryo-electron micrographs. Nat Methods 16, 1153-1160 (2019) doi:10.1038/s41592-019-0575-8
- Bepler, T., Noble, A.J., Berger, B. Topaz-Denoise: general deep denoising models for cryoEM. bioRxiv 838920 (2019) doi: https://doi.org/10.1101/838920

Structura Biotechnology Inc. and cryoSPARC do not license Topaz nor distribute Topaz binaries. Please ensure you have your own copy of Topaz licensed and installed under the terms of its GNU General Public License v3.0, available for review at: https://github.com/tbepler/topaz/blob/master/LICENSE.
***************************************************************

[CPU: 246.0 MB]

Starting Topaz process using version 0.2.4...
[CPU: 246.0 MB]

Skipping preprocessing.
[CPU: 246.0 MB]

Using preprocessed micrographs from  J183/preprocessed
[CPU: 246.2 MB]

--------------------------------------------------------------
[CPU: 246.2 MB]

Inverting negative staining...
[CPU: 246.2 MB]

Inverting negative staining complete.

[CPU: 246.2 MB]

--------------------------------------------------------------
[CPU: 246.2 MB]

Starting extraction...

[CPU: 246.2 MB]

Starting extraction by running command (path)/topaz.sh extract --radius 7 --threshold -6 --up-scale 4 --assignment-radius -1 --min-radius 5 --max-radius 100 --step-radius 5 --num-workers 8 --device 0 --model (path) -o (path) [MICROGRAPH PATHS EXCLUDED FOR LEGIBILITY]

[CPU: 246.2 MB]

Please type
[CPU: 246.2 MB]

source
[CPU: 246.2 MB]

/(path)/anaconda3/2022.05/activate_anaconda3_2022.05.txt
[CPU: 246.2 MB]

**CudaWarning: module 'torch._C' has no attribute '_cuda_setDevice'**
**[CPU: 246.2 MB]**

**Falling back to CPU.**

To add a bit more, here is some of the output from the job info on the cluster

 ...
...
   NodeList=gpu125
   BatchHost=gpu125
   NumNodes=1 NumCPUs=8 NumTasks=8 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=8,mem=88G,node=1,billing=8,gres/gpu=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   JOB_GRES=gpu:1
     Nodes=gpu125 CPU_IDs=0-7 Mem=90112 GRES=gpu:1(IDX:0)
   MinCPUsNode=1 MinMemoryCPU=11G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
  ...
...
   TresPerNode=gpu:1
   NtasksPerTRES:0

when I SSH into GPU125 and check “nvidia-smi” there are “no running processes found”.

Something may be wrong with (path)/topaz.sh. Would you like to post the script?
Also: Do CryoSPARC GPU-enabled jobs, other than jobs that wrap TOPAZ commands, run as expected on your cluster?

hi,

Sorry for the late reply. Yes we are still having this error and apparently our IT folks are not sure what the issue is, so any help would be much appreciated!

Below is the topaz.sh, and yes other GPU jobs run without issue.

#!/usr/bin/env bash

if command -v conda > /dev/null 2>&1; then
conda deactivate > /dev/null 2>&1 || true # ignore any errors
conda deactivate > /dev/null 2>&1 || true # ignore any errors
fi

unset _CE_CONDA
unset CONDA_DEFAULT_ENV
unset CONDA_EXE
unset CONDA_PREFIX
unset CONDA_PROMPT_MODIFIER
unset CONDA_PYTHON_EXE
unset CONDA_SHLVL
unset PYTHONPATH
unset LD_PRELOAD
unset LD_LIBRARY_PATH
module load anaconda3/2022.05
source /(path)/anaconda3/2022.05/etc/profile.d/conda.sh
conda activate topaz
exec topaz $@

I am unsure about the effect of this line. Can it be omitted?

Did you confirm that in “standalone” mode, topaz commands using this specific topaz conda environment run with GPU support?

hello,
thank you again for working through this with me.

I removed the line as you suggest (module load anaconda3) and it apparently runs without issue. It has been running 24 hours on training and still going. Output looks normal and it runs through the preprocessing steps alright, however still getting the GPU issue below.

Starting training by running command (path)/topaz.sh train --train-images (path)/J114/image_list_train.txt --train-targets (path)/J114/topaz_particles_processed_train.txt --test-images (path)/CS-f-tractin/J114/image_list_test.txt --test-targets (path)/J114/topaz_particles_processed_test.txt --num-particles 22 --learning-rate 0.0002 --minibatch-size 128 --num-epochs 10 --method GE-binomial --slack -1 --autoencoder 0 --l2 0.0 --minibatch-balance 0.0625 --epoch-size 5000 --model resnet8 --units 32 --dropout 0.0 --bn on --unit-scaling 2 --ngf 32 --num-workers 8 --cross-validation-seed 1394611377 --radius 3 --num-particles 22 --device 0 --no-pretrained --save-prefix=(path)/J114/models/model -o (path)/J114/train_test_curve.txt

[CPU: 232.0 MB] # Loading model: resnet8

[CPU: 232.0 MB] # Model parameters: units=32, dropout=0.0, bn=on

[CPU: 232.0 MB] # Receptive field: 71

[CPU: 232.0 MB] CudaWarning: module 'torch._C' has no attribute '_cuda_setDevice'

[CPU: 232.0 MB] Falling back to CPU.

[CPU: 232.0 MB] # Using device=0 with cuda=False

[CPU: 232.0 MB] # Loaded 647 training micrographs with 14576 labeled particles

[CPU: 233.4 MB] # Loaded 161 test micrographs with 3616 labeled particles

[CPU: 233.4 MB] # source split p_observed num_positive_regions total_regions

[CPU: 233.4 MB] # 0 train 0.000444 422704 953108640

[CPU: 233.4 MB] # 0 test 0.000442 104864 237172320

[CPU: 233.4 MB] # Specified expected number of particle per micrograph = 22.0

[CPU: 233.4 MB] # With radius = 3

[CPU: 233.4 MB] # Setting pi = 0.0004330943847072879

[CPU: 233.4 MB] WARNING: pi=0.0004330943847072879 but the observed fraction of positives is 0.0004435003338129429 and method is set to GE-binomial.

[CPU: 233.4 MB] WARNING: setting method to PN with pi=0.0004435003338129429 instead.

[CPU: 233.4 MB] WARNING: if you meant to use GE-binomial, please set pi > 0.0004435003338129429.

[CPU: 233.4 MB] # minibatch_size=128, epoch_size=5000, num_epochs=10

[CPU: 233.4 MB] RuntimeWarning: overflow encountered in exp

As you suggested I tried running it in standalone mode. I logged directly onto a GPU interactively then executed the command above. It did run as expected, but gave again the same error about falling back to CPU.

I also tried adding the following line but it didn’t change anything. So weird.

export CUDA_VISIBLE_DEVICES=0,1

I’m guessing this is some incompatibility between pytorch and cuda? I can ask the system admin to reinstall these things. Or possibly the conda environment is not working correctly.

thanks again, your time and help are much appreciated!

Installing topaz in in its own conda environment should go a long way in ensuring compatibility between the various toolkits. You or your IT support may want to separately ensure that

  1. no other version of the toolkits than the ones in the topaz conda environment get transiently “injected” in the cluster environment where topaz runs
  2. the nvidia driver on the GPU node is compatible with the CUDA toolkit in the topaz conda environment