Deep Picker Train fails (out of GPU memory?): Dst tensor is not initialized

One of our researchers is trying to use deep particle picking but the training jobs are crashing. It looks to me like insufficient GPU memory. This error has been mentioned before https://discuss.cryosparc.com/t/error-in-deep-picker-train/13932.

Can anyone tell me what affects the GPU memory requirements of a Deep Picker Train job? It looks like the number of micrographs is a factor, but do any of the job parameters (threads, fractions, shape of micrographs, etc.) affect this?

Current cryoSPARC version: v4.7.0
Single workstation, Ubuntu 24.04, 512GB RAM, 4 x 24GB RTX A5000, CUDA 12.9

Data set contains 6839 micrographs and 2,374,499 particles (I think the researcher is being a bit optimistic here, but it worked with ~1000 micrographs).

The final error is:

Traceback (most recent call last):
  File "cryosparc_master/cryosparc_compute/run.py", line 129, in cryosparc_master.cryosparc_compute.run.main
  File "cryosparc_master/cryosparc_compute/jobs/deep_picker/run_deep_picker.py", line 275, in cryosparc_master.cryosparc_compute.jobs.deep_picker.run_deep_picker.run_deep_picker_train
  File "cryosparc_master/cryosparc_compute/jobs/deep_picker/train.py", line 56, in cryosparc_master.cryosparc_compute.jobs.deep_picker.train.train_picker
  File "cryosparc_master/cryosparc_compute/jobs/deep_picker/train.py", line 121, in cryosparc_master.cryosparc_compute.jobs.deep_picker.train.train_picker
  File "/local_slow/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 793, in from_tensor_slices
    return TensorSliceDataset(tensors, name=name)
  File "/local_slow/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 4477, in __init__
    element = structure.normalize_element(element)
  File "/local_slow/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/tensorflow/python/data/util/structure.py", line 125, in normalize_element
    ops.convert_to_tensor(t, name="component_%d" % i, dtype=dtype))
  File "/local_slow/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/tensorflow/python/profiler/trace.py", line 183, in wrapped
    return func(*args, **kwargs)
  File "/local_slow/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/tensorflow/python/framework/ops.py", line 1695, in convert_to_tensor
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
  File "/local_slow/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/tensorflow/python/framework/tensor_conversion_registry.py", line 48, in _default_conversion_function
    return constant_op.constant(value, dtype, name=name)
  File "/local_slow/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/tensorflow/python/framework/constant_op.py", line 267, in constant
    return _constant_impl(value, dtype, shape, name, verify_shape=False,
  File "/local_slow/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/tensorflow/python/framework/constant_op.py", line 279, in _constant_impl
    return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
  File "/local_slow/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/tensorflow/python/framework/constant_op.py", line 304, in _constant_eager_impl
    t = convert_to_eager_tensor(value, ctx, dtype)
  File "/local_slow/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/tensorflow/python/framework/constant_op.py", line 102, in convert_to_eager_tensor
    return ops.EagerTensor(value, ctx.device_name, dtype)
tensorflow.python.framework.errors_impl.InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not initialized.
set status to failed

From the job log, what makes me think it is a GPU memory issue is:

2025-05-15 14:51:07.749561: W tensorflow/core/common_runtime/bfc_allocator.cc:462] Allocator (GPU_0_bfc) ran out of memo
ry trying to allocate 42.08GiB (rounded to 45178421248)requested by op _EagerConst
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve th
e situation. 

Hi @mokca!

That definitely looks like a GPU memory issue. For deep pickers in general, we do not expect a significant improvement in training with more than a hundred or so micrographs (if even that many). We therefore would not expect you to see much improvement from trainings beyond your initial run with ~1k micrographs.

Many thanks - I did wonder whether there was any point having that many micrographs.