TOPAZ not using GPUs and CPUs

Hello,

I am using TOPAZ extract on our cluster setup. When I leave defaults I get the error in the image about torch._C. So instead I try to run on CPU only but that seems to not be working. If I ask for e.g. 6 CPUs and then I log into the node and I see only 1 CPU being used.

What is the difference here between threads and CPUs?

thanks
image

@orangeboomerang Please can you post the text of your error message (so users with a similar problem will find this topic in the future) and let us know whether:

  1. other GPU jobs types are running on this worker as expected
  2. you installed topaz in its own conda environment
  3. you are using a wrapper script for CryoSPARC-embedded topaz jobs
  4. topaz jobs give you the same message when you execute via them via command line using the same Path to Topaz executable that you specified for CryoSPARC-embedded Topaz Extract

Hi,

Here is the full output from the GPU job. This should answer clarify things and answer questions.

To answer your final question, yes this command WILL run on standalone TOPAZ, suggesting that something is not okay with the submission script I am using through cryosparc.

Thank you.


====================== Cluster submission script: ========================
==========================================================================
#!/bin/bash
#### cryoSPARC cluster submission script template for SLURM
/cryosparc_v4.0.0_gpu99Master_gpu118Worker_PORT51000_LIC_2dfef1c2/cryosparc_worker/bin/cryosparcw run --project P9 --job J387 --master_hostname gpu99.institute.local --master_command_core_port 51002 > /nfs/science/group/cryosparc_v3.3.2_gpu114Master_gpu118Worker_LIC_2dfef1c2_CUDA_11.0.3/(path)/J387/job.log 2>&1             - the complete command string to run the job
## 8            - the number of CPUs needed
## 1            - the number of GPUs needed.
##                            Note: the code will use this many GPUs starting from dev id 0
##                                  the cluster scheduler or this script have the responsibility
##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up
##                                  using the correct cluster-allocated GPUs.
## 8.0             - the amount of RAM needed in GB
## /nfs/science/group
/cryosparc_v4.0.0_gpu99Master_gpu118Worker_PORT51000_LIC_2dfef1c2/cryosparc_worker/bin/cryosparcw    - absolute path to the cryosparc worker command
## --project P9 --job J387 --master_hostname gpu99.institute.local --master_command_core_port 51002           - arguments to be passed to cryosparcw run
## P9        - uid of the project
## J387            - uid of the job
##
## What follows is a simple SLURM script:


#SBATCH --job-name cs_1_P9_J387
#SBATCH -n 8
#SBATCH --gres=gpu:1
#####SBATCH --mem=128000MB
#SBATCH --mem-per-cpu=11G
#SBATCH -o /nfs/science/group/cryosparc_slurm_outputs/output_P9_J387.txt
#SBATCH -e /nfs/science/group/cryosparc_slurm_outputs/error_P9_J387.txt
#Define the "gpu" partition for GPU-accelerated jobs
#SBATCH --partition=gpu

#Define the GPU architecture (GTX980 in the example, other options are GTX1080Ti, K40)
######SBATCH --constraint=GTX1080Ti
#SBATCH --exclude=gpu227,gpu228,gpu138,gpu150,gpu148
######SBATCH --constraint=buster
#SBATCH --time=96:00:00

module load cuda/11.2.2
module load tensorflow

nvidia-smi

available_devs=""
for devidx in $(seq 0 15);
do
    if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then
        if [[ -z "$available_devs" ]] ; then
            available_devs=$devidx
        else
            available_devs=$available_devs,$devidx
        fi
    fi
done
#export CUDA_VISIBLE_DEVICES=$available_devs
echo $available_devs
echo $CUDA_HOME
echo "$(hostname)"
echo $SLURM_TMPDIR

/usr/bin/nvidia-smi

module list

export CRYOSPARC_SSD_PATH="${SLURM_TMPDIR}"

/nfs/science/group/cryosparc_v4.0.0_gpu99Master_gpu118Worker_PORT51000_LIC_2dfef1c2/cryosparc_worker/bin/cryosparcw run --project P9 --job J387 --master_hostname gpu99.institute.local --master_command_core_port 51002 > /nfs/(path).job.log 2>&1 

==========================================================================
==========================================================================

-------- Submission command: 
sbatch /(path)/queue_sub_script.sh


-------- Queued on cluster at XXXXX

Job J387 Started
[CPU: 96.4 MB]

Master running v4.0.0, worker running v4.0.0
[CPU: 96.7 MB]

Working in directory: (path)
[CPU: 96.7 MB]

Running on lane slurmcluster
[CPU: 96.7 MB]

Resources allocated: 
[CPU: 96.7 MB]

  Worker:  slurmcluster
[CPU: 96.7 MB]

  CPU   :  [0, 1, 2, 3, 4, 5, 6, 7]
[CPU: 96.7 MB]

  GPU   :  [0]
[CPU: 96.7 MB]

  RAM   :  [0]
[CPU: 96.7 MB]

  SSD   :  False
[CPU: 96.7 MB]

--------------------------------------------------------------
[CPU: 96.7 MB]

Importing job module for job type topaz_extract...
[CPU: 243.0 MB]

Job ready to run
[CPU: 243.0 MB]

***************************************************************
[CPU: 243.0 MB]

Topaz is a particle detection tool created by Tristan Bepler and Alex J. Noble.
Citations:
- Bepler, T., Morin, A., Rapp, M. et al. Positive-unlabeled convolutional neural networks for particle picking in cryo-electron micrographs. Nat Methods 16, 1153-1160 (2019) doi:10.1038/s41592-019-0575-8
- Bepler, T., Noble, A.J., Berger, B. Topaz-Denoise: general deep denoising models for cryoEM. bioRxiv 838920 (2019) doi: https://doi.org/10.1101/838920

Structura Biotechnology Inc. and cryoSPARC do not license Topaz nor distribute Topaz binaries. Please ensure you have your own copy of Topaz licensed and installed under the terms of its GNU General Public License v3.0, available for review at: https://github.com/tbepler/topaz/blob/master/LICENSE.
***************************************************************

[CPU: 246.0 MB]

Starting Topaz process using version 0.2.4...
[CPU: 246.0 MB]

Skipping preprocessing.
[CPU: 246.0 MB]

Using preprocessed micrographs from  J183/preprocessed
[CPU: 246.2 MB]

--------------------------------------------------------------
[CPU: 246.2 MB]

Inverting negative staining...
[CPU: 246.2 MB]

Inverting negative staining complete.

[CPU: 246.2 MB]

--------------------------------------------------------------
[CPU: 246.2 MB]

Starting extraction...

[CPU: 246.2 MB]

Starting extraction by running command (path)/topaz.sh extract --radius 7 --threshold -6 --up-scale 4 --assignment-radius -1 --min-radius 5 --max-radius 100 --step-radius 5 --num-workers 8 --device 0 --model (path) -o (path) [MICROGRAPH PATHS EXCLUDED FOR LEGIBILITY]

[CPU: 246.2 MB]

Please type
[CPU: 246.2 MB]

source
[CPU: 246.2 MB]

/(path)/anaconda3/2022.05/activate_anaconda3_2022.05.txt
[CPU: 246.2 MB]

**CudaWarning: module 'torch._C' has no attribute '_cuda_setDevice'**
**[CPU: 246.2 MB]**

**Falling back to CPU.**

To add a bit more, here is some of the output from the job info on the cluster

 ...
...
   NodeList=gpu125
   BatchHost=gpu125
   NumNodes=1 NumCPUs=8 NumTasks=8 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=8,mem=88G,node=1,billing=8,gres/gpu=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   JOB_GRES=gpu:1
     Nodes=gpu125 CPU_IDs=0-7 Mem=90112 GRES=gpu:1(IDX:0)
   MinCPUsNode=1 MinMemoryCPU=11G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
  ...
...
   TresPerNode=gpu:1
   NtasksPerTRES:0

when I SSH into GPU125 and check “nvidia-smi” there are “no running processes found”.

Something may be wrong with (path)/topaz.sh. Would you like to post the script?
Also: Do CryoSPARC GPU-enabled jobs, other than jobs that wrap TOPAZ commands, run as expected on your cluster?