THCudaCheck FAIL

pranav · May 18, 2020, 1:09pm

Hi,

I have installed the latest version of Topaz from Tristan’s GitHub page using the instructions he provided. Then, I followed your instructions to make the exe visible to cryosparc. Having done this, when I run the program from within cryosparc, it is executed successfully upto a certain stage (see below for the log) and then the process does not appear to progress beyond the THCudaCheck error. Can you help me understand how this problem can be circumvented?

[CPU: 166.7 MB]  Starting Topaz process using version 0.2.3...
[CPU: 166.8 MB]  Random seed used is 1328817122
[CPU: 166.8 MB]  --------------------------------------------------------------
[CPU: 166.8 MB]  Starting preprocessing...
[CPU: 166.8 MB]  Starting micrograph preprocessing by running command /path/to/topaz/bin/topaz preprocess --scale 16 --niters 200 --num-workers 8 -o  /path/to/preprocessed [MICROGRAPH PATHS EXCLUDED FOR LEGIBILITY]
[CPU: 166.8 MB]  Preprocessing over 4 processes...
[CPU: 166.9 MB]  Inverting negative staining...
[CPU: 166.9 MB]  Inverting negative staining complete.
[CPU: 166.9 MB]  Micrograph preprocessing command complete.
[CPU: 166.9 MB]  Starting particle pick preprocessing by running command  /path/to/topaz convert --down-scale 16 --threshold 0 -o  /path/to/topaz_particles_processed.txt  /path/to/topaz_particles_raw.txt
[CPU: 166.9 MB]  Particle pick preprocessing command complete.
[CPU: 166.9 MB]  Preprocessing done in 47.271s.
[CPU: 166.9 MB]  --------------------------------------------------------------
[CPU: 166.9 MB]  Starting train-test splitting...
[CPU: 166.9 MB]  Starting dataset splitting by running command /path/to/topaz train_test_split --number 19 --seed 1328817122 --image-dir /path/to/topaz_particles_processed.txt
[CPU: 166.9 MB]  # splitting 97 micrographs with 15403 labeled particles into 78 train and 19 test micrographs
[CPU: 166.9 MB]  # writing: /path/to/topaz_particles_processed_train.txt
[CPU: 166.9 MB]  # writing:  /path/to/opaz_particles_processed_test.txt
[CPU: 166.9 MB]  # writing: /path/to/image_list_train.txt
[CPU: 166.9 MB]  # writing: /path/to/image_list_test.txt
[CPU: 166.9 MB]  
Dataset splitting command complete.
[CPU: 166.9 MB]  Train-test splitting done in 18.990s.
[CPU: 166.9 MB]  --------------------------------------------------------------
[CPU: 166.9 MB]  Starting training...
[CPU: 166.9 MB]  Starting training by running command /path/to/topaz/bin/topaz train --train-images path/to/image_list_train.txt --train-targets /path/totopaz_particles_processed_train.txt --test-images path/to/image_list_test.txt --test-targets/path/to/topaz_particles_processed_test.txt --num-particles 100 --learning-rate 0.0002 --minibatch-size 128 --num-epochs 10 --method GE-binomial --slack -1 --autoencoder 0 --l2 0.0 --minibatch-balance 0.0625 --epoch-size 5000 --model resnet8 --units 32 --dropout 0.0 --bn on --unit-scaling 2 --ngf 32 --num-workers 8 --cross-validation-seed 1328817122 --device 0 --no-pretrained --save-prefix=/path/to/models/model -o /path/to/train_test_curve.txt
[CPU: 166.9 MB]  # Loading model: resnet8
[CPU: 166.9 MB]  # Model parameters: units=32, dropout=0.0, bn=on
[CPU: 166.9 MB]  # Receptive field: 71
[CPU: 166.9 MB]  # Using device=0 with cuda=True
[CPU: 166.9 MB]  # Loaded 78 training micrographs with 12548 labeled particles
[CPU: 166.9 MB]  # Loaded 19 test micrographs with 2855 labeled particles
[CPU: 166.9 MB]  # source	split	p_observed	num_positive_regions	total_regions
[CPU: 166.9 MB]  # 0	train	0.0845	363892	4306302
[CPU: 166.9 MB]  # 0	test	0.0789	82795	1048971
[CPU: 166.9 MB]  # Specified expected number of particle per micrograph = 100.0
[CPU: 166.9 MB]  # With radius = 3
[CPU: 166.9 MB]  # Setting pi = 0.0525276675904
[CPU: 166.9 MB]  WARNING: pi=0.0525276675904 but the observed fraction of positives is 0.084502201657 and method is set to GE-binomial.
[CPU: 166.9 MB]  WARNING: setting method to PN with pi=0.084502201657 instead.
[CPU: 166.9 MB]  WARNING: if you meant to use GE-binomial, please set pi > 0.084502201657.
[CPU: 166.9 MB]  # minibatch_size=128, epoch_size=5000, num_epochs=10
[CPU: 166.9 MB]  THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1544084119927/work/aten/src/THC/THCGeneral.cpp line=405 error=11 : invalid argument

Looking forward to hearing from you.

Best,
Pranav

stephan · May 20, 2020, 3:50pm

Hi @pranav,

The paths the output message reports don’t look correct- though I’m not sure if you intentionally edited the output message to not have any personal information.

Also, if it possible if you can explain how you installed Topaz, and on what system? (OS, GPUs, etc.)

pranav · May 20, 2020, 5:38pm

Hi Stephen,
Thanks for your reply. I have intentionally obscured the path to topaz executable to maintain privacy
The specific steps I followed are:

1. Module load Anaconda3/2019*
2. conda env create -p path/to/topaz python=2.7
3. source activate /path/to/topaz
4. conda install topaz cudatoolkit=10.0 -c tbepler -c pytorch
5. which topaz
6. copy paste path to topaz exe and enter csparc window.`

This is installed on my user account on the cluster and as such the sys configs vary, however, all GPU’s run with CUDA 10 and systems are running Linux version 3.10.0-957.10.1.el7.x86_64.

Does this help?

Markel · July 14, 2020, 11:26am

Hi, I have the exact same problem, and I was wondering if you managed to solve it

pranav · July 14, 2020, 3:25pm

Hi, You can ignore this error. Would be nice to what causes it, but as such does not have any effect on the successful running of the job

Markel · July 15, 2020, 10:35am

Thanks, saw that the job finished anyway.

I installed topaz in another machine without specifying the cudatoolkit=10.2 option, and that machine does not display the error.