Hi, I have the same issue with deep picker. The training finishes normally, but inference fails the way other users saw. I tried several datasets, two different lanes with different types of GPUs and am getting same errors. Below is one of the job logs:
================= CRYOSPARCW ======= 2022-07-19 16:53:24.477819 =========
Project P218 Job J58
Master cryosparc.host.utmb.edu Port 39002
========= monitor process now starting main process
MAINPROCESS PID 242447
========= monitor process now waiting for main process
MAIN PID 242447
deep_picker.run_deep_picker cryosparc_compute.jobs.jobregister
2022-07-19 16:53:26.442804: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully o
pened dynamic library libcudart.so.11.0
========= sending heartbeat
2022-07-19 16:53:36.845068: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_x
la_enable_xla_devices not set
2022-07-19 16:53:36.848634: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully o
pened dynamic library libcuda.so.1
2022-07-19 16:53:36.870129: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with p
roperties:
pciBusID: 0000:18:00.0 name: NVIDIA GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2022-07-19 16:53:36.870569: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with p
roperties:
pciBusID: 0000:3b:00.0 name: NVIDIA GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2022-07-19 16:53:36.870954: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 2 with p
roperties:
pciBusID: 0000:86:00.0 name: NVIDIA GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2022-07-19 16:53:36.871349: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 3 with p
roperties:
pciBusID: 0000:af:00.0 name: NVIDIA GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2022-07-19 16:53:36.871388: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully o
pened dynamic library libcudart.so.11.0
2022-07-19 16:53:36.882954: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully o
pened dynamic library libcublas.so.11
2022-07-19 16:53:36.883058: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully o
pened dynamic library libcublasLt.so.11
2022-07-19 16:53:36.886501: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully o
pened dynamic library libcufft.so.10
2022-07-19 16:53:36.888658: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully o
pened dynamic library libcurand.so.10
2022-07-19 16:53:36.891538: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully o
pened dynamic library libcusolver.so.10
2022-07-19 16:53:36.894541: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully o
pened dynamic library libcusparse.so.11
2022-07-19 16:53:36.896552: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully o
pened dynamic library libcudnn.so.8
2022-07-19 16:53:36.899540: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu de
vices: 0, 1, 2, 3
2022-07-19 16:53:36.900535: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is
optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in perfo
rmance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-07-19 16:53:36.903344: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_x
la_enable_xla_devices not set
2022-07-19 16:53:36.903935: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with p
roperties:
pciBusID: 0000:18:00.0 name: NVIDIA GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2022-07-19 16:53:36.903986: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully o
pened dynamic library libcudart.so.11.0
2022-07-19 16:53:36.904015: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully o
pened dynamic library libcublas.so.11
2022-07-19 16:53:36.904031: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully o
pened dynamic library libcublasLt.so.11
2022-07-19 16:53:36.904046: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully o
pened dynamic library libcufft.so.10
2022-07-19 16:53:36.904061: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully o
pened dynamic library libcurand.so.10
2022-07-19 16:53:36.904076: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully o
pened dynamic library libcusolver.so.10
2022-07-19 16:53:36.904091: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully o
pened dynamic library libcusparse.so.11
2022-07-19 16:53:36.904106: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully o
pened dynamic library libcudnn.so.8
2022-07-19 16:53:36.904833: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu de
vices: 0
2022-07-19 16:53:36.904870: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully o
pened dynamic library libcudart.so.11.0
2022-07-19 16:53:37.683131: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect S
treamExecutor with strength 1 edge matrix:
2022-07-19 16:53:37.683194: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267] 0
2022-07-19 16:53:37.683209: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0: N
2022-07-19 16:53:37.684870: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow de
vice (/job:localhost/replica:0/task:0/device:GPU:0 with 128 MB memory) → physical GPU (device: 0, name: N
VIDIA GeForce RTX 2080 Ti, pci bus id: 0000:18:00.0, compute capability: 7.5)
WARNING:tensorflow:Error in loading the saved optimizer state. As a result, your model is starting with a
freshly initialized optimizer.
========= sending heartbeat
2022-07-19 16:53:46.646396: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the ML
IR optimization passes are enabled (registered 2)
2022-07-19 16:53:46.647119: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2200
000000 Hz
2022-07-19 16:53:48.217213: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully o
pened dynamic library libcudnn.so.8
2022-07-19 16:53:50.627088: W tensorflow/core/common_runtime/bfc_allocator.cc:248] Allocator (GPU_0_bfc) r
an out of memory trying to allocate 302.29MiB with freed_by_count=0. The caller indicates that this is not
a failure, but may mean that there could be performance gains if more memory were available.
2022-07-19 16:53:50.627287: W tensorflow/core/common_runtime/bfc_allocator.cc:248] Allocator (GPU_0_bfc) r
an out of memory trying to allocate 302.29MiB with freed_by_count=0. The caller indicates that this is not
a failure, but may mean that there could be performance gains if more memory were available.
2022-07-19 16:53:50.627825: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully o
pened dynamic library libcublas.so.11
2022-07-19 16:53:51.580539: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully o
pened dynamic library libcublasLt.so.11
2022-07-19 16:53:51.585989: W tensorflow/core/common_runtime/bfc_allocator.cc:248] Allocator (GPU_0_bfc) r
an out of memory trying to allocate 548.13MiB with freed_by_count=0. The caller indicates that this is not
a failure, but may mean that there could be performance gains if more memory were available.
2022-07-19 16:53:51.586063: W tensorflow/core/common_runtime/bfc_allocator.cc:248] Allocator (GPU_0_bfc) r
an out of memory trying to allocate 548.13MiB with freed_by_count=0. The caller indicates that this is not
a failure, but may mean that there could be performance gains if more memory were available.
2022-07-19 16:53:51.590803: W tensorflow/core/common_runtime/bfc_allocator.cc:248] Allocator (GPU_0_bfc) r
an out of memory trying to allocate 88.00MiB with freed_by_count=0. The caller indicates that this is not
a failure, but may mean that there could be performance gains if more memory were available.
2022-07-19 16:53:51.590875: W tensorflow/core/common_runtime/bfc_allocator.cc:248] Allocator (GPU_0_bfc) r
an out of memory trying to allocate 88.00MiB with freed_by_count=0. The caller indicates that this is not
a failure, but may mean that there could be performance gains if more memory were available.
2022-07-19 16:53:51.596573: W tensorflow/core/common_runtime/bfc_allocator.cc:248] Allocator (GPU_0_bfc) r
an out of memory trying to allocate 592.14MiB with freed_by_count=0. The caller indicates that this is not
a failure, but may mean that there could be performance gains if more memory were available.
2022-07-19 16:53:51.596639: W tensorflow/core/common_runtime/bfc_allocator.cc:248] Allocator (GPU_0_bfc) r
an out of memory trying to allocate 592.14MiB with freed_by_count=0. The caller indicates that this is not
a failure, but may mean that there could be performance gains if more memory were available.
2022-07-19 16:53:51.615351: W tensorflow/core/common_runtime/bfc_allocator.cc:248] Allocator (GPU_0_bfc) r
an out of memory trying to allocate 548.16MiB with freed_by_count=0. The caller indicates that this is not
a failure, but may mean that there could be performance gains if more memory were available.
2022-07-19 16:53:51.615425: W tensorflow/core/common_runtime/bfc_allocator.cc:248] Allocator (GPU_0_bfc) r
an out of memory trying to allocate 548.16MiB with freed_by_count=0. The caller indicates that this is not
a failure, but may mean that there could be performance gains if more memory were available.
Running job on hostname %s gimli.utmb.edu
Allocated Resources : {‘fixed’: {‘SSD’: False}, ‘hostname’: ‘gimli.utmb.edu’, ‘lane’: ‘smith’, 'lane_type
': ‘smith’, ‘license’: True, ‘licenses_acquired’: 1, ‘slots’: {‘CPU’: [0], ‘GPU’: [0], ‘RAM’: [0]}, ‘targe
t’: {‘cache_path’: ‘/mnt/scratch/cryosparc_cache’, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 10000, ‘des
c’: None, ‘gpus’: [{‘id’: 0, ‘mem’: 11554848768, ‘name’: ‘NVIDIA GeForce RTX 2080 Ti’}, {‘id’: 1, ‘mem’: 1
1554848768, ‘name’: ‘NVIDIA GeForce RTX 2080 Ti’}, {‘id’: 2, ‘mem’: 11554848768, ‘name’: ‘NVIDIA GeForce R
TX 2080 Ti’}, {‘id’: 3, ‘mem’: 11554848768, ‘name’: ‘NVIDIA GeForce RTX 2080 Ti’}], ‘hostname’: ‘gimli.utm
b.edu’, ‘lane’: ‘smith’, ‘monitor_port’: None, ‘name’: ‘gimli.utmb.edu’, ‘resource_fixed’: {‘SSD’: True},
‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 2
2, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39], ‘GPU’: [0, 1, 2, 3], ‘RAM’: [0, 1,
2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 3
0, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47]}, ‘ssh_str’: ‘cryosparc@gimli.utmb.
edu’, ‘title’: ‘Worker node gimli.utmb.edu’, ‘type’: ‘node’, ‘worker_bin_path’: ‘/mnt/ape2/cryosparc/softw
are/cryosparc/cryosparc_worker/bin/cryosparcw’}}
**** handle exception rc
set status to failed
Tried to split input micrographs, still get the same error:
Launching job on lane default target cryosparc.host.utmb.edu …
Running job on master node hostname cryosparc.host.utmb.edu
[CPU: 69.2 MB] Project P218 Job J58 Started
[CPU: 69.3 MB] Master running v3.3.2+220518, worker running v3.3.2+220518
[CPU: 69.5 MB] Working in directory: /mnt/gimli/data2/P218/J58
[CPU: 69.5 MB] Running on lane default
[CPU: 69.5 MB] Resources allocated:
[CPU: 69.5 MB] Worker: cryosparc.host.utmb.edu
[CPU: 69.5 MB] CPU : [0]
[CPU: 69.5 MB] GPU : [0]
[CPU: 69.5 MB] RAM : [0]
[CPU: 69.5 MB] SSD : False
[CPU: 69.5 MB] --------------------------------------------------------------
[CPU: 69.5 MB] Importing job module for job type deep_picker_inference…
[CPU: 364.5 MB] Job ready to run
[CPU: 364.5 MB] ***************************************************************
[CPU: 368.5 MB] Using TensorFlow version 2.4.1
[CPU: 368.5 MB] Processing micrographs and inferring particles…
[CPU: 368.5 MB] Loading model…
[CPU: 368.5 MB] Loaded model.
[CPU: 368.5 MB] 0/500 micrographs processed.
[CPU: 1.97 GB] Original micrograph:
[CPU: 1.78 GB] Traceback (most recent call last):
File “cryosparc_worker/cryosparc_compute/run.py”, line 85, in cryosparc_compute.run.main
File “cryosparc_worker/cryosparc_compute/jobs/deep_picker/run_deep_picker.py”, line 535, in cryosparc_compute.jobs.deep_picker.run_deep_picker.run_deep_picker_inference
File “cryosparc_worker/cryosparc_compute/jobs/deep_picker/deep_picker_utils.py”, line 875, in cryosparc_compute.jobs.deep_picker.deep_picker_utils.picker_extract_worker
File “cryosparc_worker/cryosparc_compute/jobs/deep_picker/deep_picker_utils.py”, line 880, in cryosparc_compute.jobs.deep_picker.deep_picker_utils.picker_extract_worker
File “cryosparc_worker/cryosparc_compute/jobs/deep_picker/deep_picker_utils.py”, line 792, in cryosparc_compute.jobs.deep_picker.deep_picker_utils.picker_extract_worker._do_picking
File “/mnt/ape2/cryosparc/software/cryosparc/cryosparc_worker/cryosparc_compute/micrograph_plotutils.py”, line 45, in showarray
a = a.reshape(-1, a.shape[-2], a.shape[-1])
ValueError: cannot reshape array of size 0 into shape (0,1)
This particular error comes up with “show plots” turned on, without that I get “ValueError: need more than 1 value to unpack” error.
Another user can run successfully inference job on the same default lane. I am not sure what is the difference.
Thanks, Michael