Increase of computational minibatch leads to error or slow down in hetero refine

KiSchnelle · January 5, 2024, 10:24am

I just played a bit around with the settings in hetero refine and just found that increasing the computational minibatch size >= 5000 leads to CUDA_ERROR_ILLEGAL_ADDRESS error (on A40, Driver Version: 530.30.02).

I verified it on two different servers to exclude hardware defects.

Traceback (most recent call last):
  File "/home/cryosparcuser/cryosparc_worker/cryosparc_compute/jobs/runcommon.py", line 2192, in run_with_except_hook
    run_old(*args, **kw)
  File "cryosparc_master/cryosparc_compute/gpu/gpucore.py", line 134, in cryosparc_master.cryosparc_compute.gpu.gpucore.GPUThread.run
  File "cryosparc_master/cryosparc_compute/gpu/gpucore.py", line 135, in cryosparc_master.cryosparc_compute.gpu.gpucore.GPUThread.run
  File "cryosparc_master/cryosparc_compute/engine/engine.py", line 1130, in cryosparc_master.cryosparc_compute.engine.engine.process.work
  File "cryosparc_master/cryosparc_compute/engine/engine.py", line 551, in cryosparc_master.cryosparc_compute.engine.engine.EngineThread.cull_candidates
  File "cryosparc_master/cryosparc_compute/gpu/gpucore.py", line 438, in cryosparc_master.cryosparc_compute.gpu.gpucore.EngineBaseThread.to_host
  File "cryosparc_master/cryosparc_compute/gpu/gpucore.py", line 335, in cryosparc_master.cryosparc_compute.gpu.gpucore.EngineBaseThread.wait
  File "/home/cryosparcuser/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 2245, in synchronize
    driver.cuStreamSynchronize(self.handle)
  File "/home/cryosparcuser/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 352, in safe_cuda_api_call
    return self._check_cuda_python_error(fname, libfn(*args))
  File "/home/cryosparcuser/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 412, in _check_cuda_python_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [CUresult.CUDA_ERROR_ILLEGAL_ADDRESS] Call to cuStreamSynchronize results in CUDA_ERROR_ILLEGAL_ADDRESS



Jan  5 09:48:07 bert102 kernel: [23757298.153163] NVRM: Xid (PCI:0000:01:00): 31, pid=4067392, name=python, Ch 00000006, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_8 faulted @ 0x7eec_fe000000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ

Jan  5 09:53:32 bert107 kernel: [23762045.777297] NVRM: Xid (PCI:0000:61:00): 31, pid=1542913, name=python, Ch 00000006, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_9 faulted @ 0x7f25_76000000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ

Also i found that increasing it actually decreases iteration speed.

Batchsize 4000, minibatch default=500

[CPU:  14.00 GB] -- Iteration 1
[CPU:  14.00 GB] Batch size 8000 
[CPU:  14.00 GB] Using Alignment Radius 19.500 (13.391A)
[CPU:  14.00 GB] Using Reconstruction Radius 29.000 (9.004A)
[CPU:  14.01 GB] Randomizing assignments for identical classes...
[CPU:  14.01 GB] Number of BnB iterations 3
[CPU:  14.01 GB] DEV 0 THR 0 NUM 2000 TOTAL 17.719432 ELAPSED 4.2985999 --
[CPU:  15.33 GB] Processed 4000.000 images with 2 models in 5.764s.
[CPU:  15.33 GB] DEV 0 THR 1 NUM 2000 TOTAL 16.869069 ELAPSED 4.3001139 --
[CPU:  16.63 GB] Processed 4000.000 images with 2 models in 5.721s.

Batchsize=4000, computation minibatch=4000

[CPU:  13.42 GB] -- Iteration 1
[CPU:  13.42 GB] Batch size 8000 
[CPU:  13.42 GB] Using Alignment Radius 19.500 (13.391A)
[CPU:  13.42 GB] Using Reconstruction Radius 29.000 (9.004A)
[CPU:  13.43 GB] Randomizing assignments for identical classes...
[CPU:  13.43 GB] Number of BnB iterations 3
[CPU:  13.43 GB] DEV 0 THR 1 NUM 4000 TOTAL   0 ELAPSED 9.9518713 --
[CPU:  14.65 GB] Processed 4000.000 images with 2 models in 51.275s.
[CPU: 14.65 GB] DEV 0 THR 0 NUM 4000 TOTAL   0 ELAPSED 10.273054 --
[CPU:  14.88 GB] Processed 4000.000 images with 2 models in 60.427s.

ccgauvin94 · January 6, 2024, 3:42am

I would guess that this is because you are I/O limited, not actually computationally limited. Not sure exactly, though, it’d depend on a few things. I would think at first glance:

All the particles need to be read in, meaning that if we hold all other variables constant, whether they’re read once at a time or a million at a time, batch size shouldn’t matter.
If you have are not GPU compute limited, reading more particles in to maximize compute could ultimately be faster.

Not everything is equal though, so it’s possible to dream up some scenarios where short reads (especially from an SSD) are fast but then taper off as the load is more sustained? Or maybe they aren’t reading directly off the SSD and so things are hopping through system memory and you’re hitting a bottleneck in the memory controller? My guess is these types of workload are memory bandwidth constrained nearly 100% of the time on high performance systems. Maybe the memory bandwith on the GPU itself can’t handle that load?

CUDA_ERROR_ILLEGAL_ADDRESS

I believe that means you are out of GPU memory. I have seen then when the box size gets too large.

rbs_sci · January 6, 2024, 6:42am

As @ccgauvin94 says, running out of VRAM with larger batch sizes is pretty common with larger boxes. You can increase above the default for smaller boxes (<256 or so) but increases easily exceed available VRAM. With CryoSPARC 4.4, VRAM demands increased further due to optimisations made, which makes it even easier to run out of memory (and it’s why Low Memory Mode is now basically a requirement in NU Refine for boxes >600 pixels…)

KiSchnelle · January 8, 2024, 7:51am

Is cryosparc acutally supporting some sort of Direct Storage? If not is it planned/usefull?

Slowing down of SSD could be a reason though we use quite good PCIe 4 ones in raid 0 for scratch storage on the nodes. I will try to test the same on a node with A100 and H100 respectivly to see if the increased memory bandwith makes a difference. Also i will test again with different boxsizes.

Thanks for the feedback!

rbs_sci · January 8, 2024, 1:52pm

DirectStorage is a Windows-only thing? Or do you mean GPUDirect? nVidia’s white paper on it was 2019 or something, I’ve not heard much about it since then, although I know AMD have a workstation card (maybe more than one, I’ve basically given up on AMD for compute for the foreseeable future so I’m not really keeping track) which can have M.2 SSDs connected directly.

KiSchnelle · January 8, 2024, 2:26pm

Yes i mean the Nvidia one, iam not really into programming that kind of things but it seems to get new releases still atleast.

Not sure cryosparc could benifit at all from it though, or if it would make sense to put time into it since i guess its data center cards only and lot of people use consumer ones i suppose. Would be interesting though since i know also BeeGFS, which we use, supports it in theory. Would be interesting then to compare it with BeeGFS and the infiniband network instead of scratch SSDs at all.

vperetroukhin · January 8, 2024, 10:35pm

Hi @KiSchnelle,

The CUDA_ERROR_ILLEGAL_ADDRESS error is likely due to an internal limitation within CryoSPARC which states that CUDA arrays must have at most 2^31 float32s. During refinement, we allocate a large single array that tracks projection errors across pose, shift, class, and particle # within the computational mini batch. In practice, this means that the maximum computational mini batch will be on the order of a few thousand particles (and will depend on other parameters like class #). This limit can be exceeded prior to running out of GPU memory, as it likely is here.

Note that, unlike batch size, the computational mini batch parameter should not affect the output of hetero refine in any way (apart from causing the job to crash :P), but can speed up/slow things down depending on other bottlenecks, as others have already mentioned.

Hope that helps!
Valentin

rbs_sci · January 9, 2024, 2:39am

Gotcha. DMA from GPU might yield improvements, but I wonder with so much else going on during a refinement whether it would actually make much difference…