CUDA_ERROR_INVALID_HANDLE during Deeppicker training

Marcell · July 1, 2025, 5:37pm

I tried running a Deeppicker training job and I got the error below. Making this job work is not a high priority for me, but I thought I would raise the issue. It seems like a compatibility issue with Tensorflow. Is this something that I need to update on my end, or is csparc 4.7.1-cuda12 using a version of tensorflow that is not yet cuda-12.8 compatible?

Traceback (most recent call last):
  File "cryosparc_master/cryosparc_compute/run.py", line 129, in cryosparc_master.cryosparc_compute.run.main
  File "cryosparc_master/cryosparc_compute/jobs/deep_picker/run_deep_picker.py", line 275, in cryosparc_master.cryosparc_compute.jobs.deep_picker.run_deep_picker.run_deep_picker_train
  File "cryosparc_master/cryosparc_compute/jobs/deep_picker/train.py", line 56, in cryosparc_master.cryosparc_compute.jobs.deep_picker.train.train_picker
  File "cryosparc_master/cryosparc_compute/jobs/deep_picker/train.py", line 118, in cryosparc_master.cryosparc_compute.jobs.deep_picker.train.train_picker
  File "/home/turul_csparc/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 1510, in shuffle
    return shuffle_op._shuffle(  # pylint: disable=protected-access
  File "/home/turul_csparc/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/tensorflow/python/data/ops/shuffle_op.py", line 32, in _shuffle
    return _ShuffleDataset(
  File "/home/turul_csparc/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/tensorflow/python/data/ops/shuffle_op.py", line 51, in __init__
    self._seed, self._seed2 = random_seed.get_seed(seed)
  File "/home/turul_csparc/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/tensorflow/python/data/util/random_seed.py", line 50, in get_seed
    math_ops.equal(seed, 0), math_ops.equal(seed2, 0)),
  File "/home/turul_csparc/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/turul_csparc/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/tensorflow/python/framework/ops.py", line 6002, in raise_from_not_ok_status
    raise core._status_to_exception(e) from None  # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.InternalError: {{function_node __wrapped__Equal_device_/job:localhost/replica:0/task:0/device:GPU:0}} 'cuLaunchKernel(function, gridX, gridY, gridZ, blockX, blockY, blockZ, 0, reinterpret_cast<CUstream>(stream), params, nullptr)' failed with 'CUDA_ERROR_INVALID_HANDLE' [Op:Equal] name:

wtempel · July 2, 2025, 2:47pm

Thanks @Marcell for reporting

I moved your post to this new forum topic.
Please can you post the outputs of these commands on the worker node where you observed the error

uname -a
cat /etc/*release
nvidia-smi