One of our researchers is trying to use deep particle picking but the training jobs are crashing. It looks to me like insufficient GPU memory. This error has been mentioned before https://discuss.cryosparc.com/t/error-in-deep-picker-train/13932.
Can anyone tell me what affects the GPU memory requirements of a Deep Picker Train job? It looks like the number of micrographs is a factor, but do any of the job parameters (threads, fractions, shape of micrographs, etc.) affect this?
Current cryoSPARC version: v4.7.0
Single workstation, Ubuntu 24.04, 512GB RAM, 4 x 24GB RTX A5000, CUDA 12.9
Data set contains 6839 micrographs and 2,374,499 particles (I think the researcher is being a bit optimistic here, but it worked with ~1000 micrographs).
The final error is:
Traceback (most recent call last):
File "cryosparc_master/cryosparc_compute/run.py", line 129, in cryosparc_master.cryosparc_compute.run.main
File "cryosparc_master/cryosparc_compute/jobs/deep_picker/run_deep_picker.py", line 275, in cryosparc_master.cryosparc_compute.jobs.deep_picker.run_deep_picker.run_deep_picker_train
File "cryosparc_master/cryosparc_compute/jobs/deep_picker/train.py", line 56, in cryosparc_master.cryosparc_compute.jobs.deep_picker.train.train_picker
File "cryosparc_master/cryosparc_compute/jobs/deep_picker/train.py", line 121, in cryosparc_master.cryosparc_compute.jobs.deep_picker.train.train_picker
File "/local_slow/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 793, in from_tensor_slices
return TensorSliceDataset(tensors, name=name)
File "/local_slow/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 4477, in __init__
element = structure.normalize_element(element)
File "/local_slow/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/tensorflow/python/data/util/structure.py", line 125, in normalize_element
ops.convert_to_tensor(t, name="component_%d" % i, dtype=dtype))
File "/local_slow/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/tensorflow/python/profiler/trace.py", line 183, in wrapped
return func(*args, **kwargs)
File "/local_slow/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/tensorflow/python/framework/ops.py", line 1695, in convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/local_slow/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/tensorflow/python/framework/tensor_conversion_registry.py", line 48, in _default_conversion_function
return constant_op.constant(value, dtype, name=name)
File "/local_slow/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/tensorflow/python/framework/constant_op.py", line 267, in constant
return _constant_impl(value, dtype, shape, name, verify_shape=False,
File "/local_slow/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/tensorflow/python/framework/constant_op.py", line 279, in _constant_impl
return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
File "/local_slow/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/tensorflow/python/framework/constant_op.py", line 304, in _constant_eager_impl
t = convert_to_eager_tensor(value, ctx, dtype)
File "/local_slow/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/tensorflow/python/framework/constant_op.py", line 102, in convert_to_eager_tensor
return ops.EagerTensor(value, ctx.device_name, dtype)
tensorflow.python.framework.errors_impl.InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not initialized.
set status to failed
From the job log, what makes me think it is a GPU memory issue is:
2025-05-15 14:51:07.749561: W tensorflow/core/common_runtime/bfc_allocator.cc:462] Allocator (GPU_0_bfc) ran out of memo
ry trying to allocate 42.08GiB (rounded to 45178421248)requested by op _EagerConst
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve th
e situation.