Problems running homogeneus refinement jobs with larger box sizes

jmcdonal · October 18, 2023, 4:09pm

Hi,
I’m working with a user trying to run a homogeneous refinement job with a box size of 1250. He runs a similar job with a box size of 1000 and that job is successful. The larger job is running on the same platform which is:
Cryosparc Version 4.3.0
Rocky Linux 9.2
Nvidia A40s (48 GB memory) with Cuda version 12.2/Driver 535.104.12

The larger job fails when the engine starts. From the job log, we see:

[CPU: 87.42 GB]
Engine Started.

[CPU: 101.35 GB]
Traceback (most recent call last):
File “/panfs/home/shafen/shcryosparc/4.3/cryosparc_worker/cryosparc_compute/jobs/runcommon.py”, line 2118, in run_with_except_hook
run_old(*args, **kw)
File “/panfs/home/shafen/shcryosparc/4.3/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/threading.py”, line 870, in run
self._target(*self._args, **self._kwargs)
File “cryosparc_master/cryosparc_compute/engine/newengine.py”, line 2450, in cryosparc_compute.engine.newengine.process.work
File “cryosparc_master/cryosparc_compute/engine/newengine.py”, line 2588, in cryosparc_compute.engine.newengine.process.work
File “cryosparc_master/cryosparc_compute/engine/newengine.py”, line 1832, in cryosparc_compute.engine.newengine.EngineThread.backproject
File “cryosparc_master/cryosparc_compute/engine/cuda_core.py”, line 316, in cryosparc_compute.engine.cuda_core.EngineBaseThread.toc
File “cryosparc_master/cryosparc_compute/engine/cuda_core.py”, line 309, in cryosparc_compute.engine.cuda_core.EngineBaseThread.wait
pycuda._driver.LogicError: cuStreamSynchronize failed: an illegal memory access was encountered

On the node, we wee a fault on the GPU:

[Wed Oct 18 10:50:48 2023] NVRM: GPU at PCI:0000:27:00: GPU-341fafc7-cab4-95fa-21f8-08f4b16f0e4f
[Wed Oct 18 10:50:48 2023] NVRM: GPU Board Serial Number: 1325021031084
[Wed Oct 18 10:50:48 2023] NVRM: Xid (PCI:0000:27:00): 31, pid=105737, name=python, Ch 00000006, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_6 faulted @ 0x7f1f_62cc5000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_ATOMIC

There is no indication that we are running out of memory on the GPU or the CPU. Does anyone have an idea as to what could be wrong? (We tried box sizes 1200, 1250, 1152 which all fail.) Box sizes 700 and 1000 do work.

We ran a strace for the failed job and at the end of the job, the strace logs show that cuda_core.py and newengine.py cannot be found. I can provide the strace logs if needed.

Thanks in advance,
Jeff

rbs_sci · October 19, 2023, 6:52am

If I remember correctly, CryoSPARC had issues with box sizes larger than 1290(?) pixels due to the pyfftw implementation used. Not sure if that is still the case (and shouldn’t be the cause of this error…)

Do you have ECC enabled on your A40s? If you do, please run nvidia-smi -q …do you see any uncorrectable errors listed? If so, the VRAM is faulty.

jmcdonal · October 19, 2023, 10:54am

Yes, we found documentation discussing that there was an issue with the sizes above 1290 but we’re below that so we didn’t think that that is the issue.

We do have ECC on on the A40s and we don’t see any ECC uncorrectable errors. The cryosparc system submits to a multi-node cluster and has failed on multiple GPUs (A40s) as well as an A100 GPU with 80 GB memory.

Thanks,
Jeff