Hi,
I’m working with a user trying to run a homogeneous refinement job with a box size of 1250. He runs a similar job with a box size of 1000 and that job is successful. The larger job is running on the same platform which is:
Cryosparc Version 4.3.0
Rocky Linux 9.2
Nvidia A40s (48 GB memory) with Cuda version 12.2/Driver 535.104.12
The larger job fails when the engine starts. From the job log, we see:
[CPU: 87.42 GB]
Engine Started.
[CPU: 101.35 GB]
Traceback (most recent call last):
File “/panfs/home/shafen/shcryosparc/4.3/cryosparc_worker/cryosparc_compute/jobs/runcommon.py”, line 2118, in run_with_except_hook
run_old(*args, **kw)
File “/panfs/home/shafen/shcryosparc/4.3/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/threading.py”, line 870, in run
self._target(*self._args, **self._kwargs)
File “cryosparc_master/cryosparc_compute/engine/newengine.py”, line 2450, in cryosparc_compute.engine.newengine.process.work
File “cryosparc_master/cryosparc_compute/engine/newengine.py”, line 2588, in cryosparc_compute.engine.newengine.process.work
File “cryosparc_master/cryosparc_compute/engine/newengine.py”, line 1832, in cryosparc_compute.engine.newengine.EngineThread.backproject
File “cryosparc_master/cryosparc_compute/engine/cuda_core.py”, line 316, in cryosparc_compute.engine.cuda_core.EngineBaseThread.toc
File “cryosparc_master/cryosparc_compute/engine/cuda_core.py”, line 309, in cryosparc_compute.engine.cuda_core.EngineBaseThread.wait
pycuda._driver.LogicError: cuStreamSynchronize failed: an illegal memory access was encountered
On the node, we wee a fault on the GPU:
[Wed Oct 18 10:50:48 2023] NVRM: GPU at PCI:0000:27:00: GPU-341fafc7-cab4-95fa-21f8-08f4b16f0e4f
[Wed Oct 18 10:50:48 2023] NVRM: GPU Board Serial Number: 1325021031084
[Wed Oct 18 10:50:48 2023] NVRM: Xid (PCI:0000:27:00): 31, pid=105737, name=python, Ch 00000006, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_6 faulted @ 0x7f1f_62cc5000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_ATOMIC
There is no indication that we are running out of memory on the GPU or the CPU. Does anyone have an idea as to what could be wrong? (We tried box sizes 1200, 1250, 1152 which all fail.) Box sizes 700 and 1000 do work.
We ran a strace for the failed job and at the end of the job, the strace logs show that cuda_core.py and newengine.py cannot be found. I can provide the strace logs if needed.
Thanks in advance,
Jeff