V.4.4.1 Cryosparc Topaz "_rmtree_safe_fd" "Device or resource busy"

YueY2017 · May 19, 2024, 6:56am

Hi,

I am running into some issues with v.4.4.1 Cryosparc Topaz
Cryosparc dosen’t throw an error but the training job never finishes. These errors were reported in the log repetitively. It seems that the error has to do with accessing and removing some links but it didn’t specify which. Please help! Thanks.

Yue


[CPU:  271.6 MB  Avail: 461.26 GB]

Starting training by running command /opt/topaz_conda_base/envs/topaz/bin/topaz train --train-images /home/yue.yu/projects_symlink/cryosparc_data_process/CS-240510-pickathon-picks-2d-classify/J120/preprocessed --train-targets /home/yue.yu/projects_symlink/cryosparc_data_process/CS-240510-pickathon-picks-2d-classify/J120/topaz_particles_processed_train.txt --num-particles 200 --learning-rate 0.0002 --minibatch-size 128 --num-epochs 10 --method GE-binomial --slack -1 --autoencoder 0 --l2 0.0 --minibatch-balance 0.0625 --epoch-size 5000 --model resnet8 --units 32 --dropout 0.0 --bn on --unit-scaling 2 --ngf 32 --num-workers 8 --cross-validation-seed 1508269122 --radius 3 --num-particles 200 --device 0 --no-pretrained --save-prefix=/home/yue.yu/projects_symlink/cryosparc_data_process/CS-240510-pickathon-picks-2d-classify/J120/models/model -o /home/yue.yu/projects_symlink/cryosparc_data_process/CS-240510-pickathon-picks-2d-classify/J120/train_test_curve.txt

[CPU:  271.7 MB  Avail: 458.61 GB]

# Loading model: resnet8
[CPU:  271.7 MB  Avail: 458.61 GB]

# Model parameters: units=32, dropout=0.0, bn=on
[CPU:  271.7 MB  Avail: 458.61 GB]

# Receptive field: 71
[CPU:  271.7 MB  Avail: 458.61 GB]

# Using device=0 with cuda=True
[CPU:  271.7 MB  Avail: 458.61 GB]

# Loaded 1 training micrographs with 47 labeled particles
[CPU:  271.7 MB  Avail: 458.61 GB]

# source	split	p_observed	num_positive_regions	total_regions
[CPU:  271.7 MB  Avail: 458.61 GB]

# 0	train	0.00139	1363	983040
[CPU:  271.7 MB  Avail: 458.61 GB]

# Specified expected number of particle per micrograph = 200.0
[CPU:  271.7 MB  Avail: 458.61 GB]

# With radius = 3
[CPU:  271.7 MB  Avail: 458.60 GB]

# Setting pi = 0.005900065104166667
[CPU:  271.7 MB  Avail: 458.60 GB]

# minibatch_size=128, epoch_size=5000, num_epochs=10
[CPU:  271.7 MB  Avail: 457.06 GB]

Traceback (most recent call last):
[CPU:  271.7 MB  Avail: 457.06 GB]

File "/opt/topaz_conda_base/envs/topaz/lib/python3.6/multiprocessing/util.py", line 262, in _run_finalizers
[CPU:  271.7 MB  Avail: 457.09 GB]

finalizer()
[CPU:  271.7 MB  Avail: 457.09 GB]

File "/opt/topaz_conda_base/envs/topaz/lib/python3.6/multiprocessing/util.py", line 186, in __call__
[CPU:  271.7 MB  Avail: 457.09 GB]

res = self._callback(*self._args, **self._kwargs)
[CPU:  271.7 MB  Avail: 457.10 GB]

File "/opt/topaz_conda_base/envs/topaz/lib/python3.6/shutil.py", line 486, in rmtree
[CPU:  271.7 MB  Avail: 457.13 GB]

_rmtree_safe_fd(fd, path, onerror)
[CPU:  271.7 MB  Avail: 457.13 GB]

File "/opt/topaz_conda_base/envs/topaz/lib/python3.6/shutil.py", line 444, in _rmtree_safe_fd
[CPU:  271.7 MB  Avail: 457.13 GB]

onerror(os.unlink, fullname, sys.exc_info())
[CPU:  271.7 MB  Avail: 457.14 GB]

File "/opt/topaz_conda_base/envs/topaz/lib/python3.6/shutil.py", line 442, in _rmtree_safe_fd
[CPU:  271.7 MB  Avail: 457.16 GB]

os.unlink(name, dir_fd=topfd)
[CPU:  271.7 MB  Avail: 457.16 GB]

OSError: [Errno 16] Device or resource busy: '.nfs00000000000075650000052b'

wtempel · May 21, 2024, 8:32pm

Are all the network filesystems on the relevant CryoSPARC worker computer fully functional?