I just played a bit around with the settings in hetero refine and just found that increasing the computational minibatch size >= 5000 leads to CUDA_ERROR_ILLEGAL_ADDRESS error (on A40, Driver Version: 530.30.02).
I verified it on two different servers to exclude hardware defects.
Traceback (most recent call last):
File "/home/cryosparcuser/cryosparc_worker/cryosparc_compute/jobs/runcommon.py", line 2192, in run_with_except_hook
run_old(*args, **kw)
File "cryosparc_master/cryosparc_compute/gpu/gpucore.py", line 134, in cryosparc_master.cryosparc_compute.gpu.gpucore.GPUThread.run
File "cryosparc_master/cryosparc_compute/gpu/gpucore.py", line 135, in cryosparc_master.cryosparc_compute.gpu.gpucore.GPUThread.run
File "cryosparc_master/cryosparc_compute/engine/engine.py", line 1130, in cryosparc_master.cryosparc_compute.engine.engine.process.work
File "cryosparc_master/cryosparc_compute/engine/engine.py", line 551, in cryosparc_master.cryosparc_compute.engine.engine.EngineThread.cull_candidates
File "cryosparc_master/cryosparc_compute/gpu/gpucore.py", line 438, in cryosparc_master.cryosparc_compute.gpu.gpucore.EngineBaseThread.to_host
File "cryosparc_master/cryosparc_compute/gpu/gpucore.py", line 335, in cryosparc_master.cryosparc_compute.gpu.gpucore.EngineBaseThread.wait
File "/home/cryosparcuser/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 2245, in synchronize
driver.cuStreamSynchronize(self.handle)
File "/home/cryosparcuser/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 352, in safe_cuda_api_call
return self._check_cuda_python_error(fname, libfn(*args))
File "/home/cryosparcuser/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 412, in _check_cuda_python_error
raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [CUresult.CUDA_ERROR_ILLEGAL_ADDRESS] Call to cuStreamSynchronize results in CUDA_ERROR_ILLEGAL_ADDRESS
Jan 5 09:48:07 bert102 kernel: [23757298.153163] NVRM: Xid (PCI:0000:01:00): 31, pid=4067392, name=python, Ch 00000006, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_8 faulted @ 0x7eec_fe000000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Jan 5 09:53:32 bert107 kernel: [23762045.777297] NVRM: Xid (PCI:0000:61:00): 31, pid=1542913, name=python, Ch 00000006, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_9 faulted @ 0x7f25_76000000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Also i found that increasing it actually decreases iteration speed.
- Batchsize 4000, minibatch default=500
[CPU: 14.00 GB] -- Iteration 1
[CPU: 14.00 GB] Batch size 8000
[CPU: 14.00 GB] Using Alignment Radius 19.500 (13.391A)
[CPU: 14.00 GB] Using Reconstruction Radius 29.000 (9.004A)
[CPU: 14.01 GB] Randomizing assignments for identical classes...
[CPU: 14.01 GB] Number of BnB iterations 3
[CPU: 14.01 GB] DEV 0 THR 0 NUM 2000 TOTAL 17.719432 ELAPSED 4.2985999 --
[CPU: 15.33 GB] Processed 4000.000 images with 2 models in 5.764s.
[CPU: 15.33 GB] DEV 0 THR 1 NUM 2000 TOTAL 16.869069 ELAPSED 4.3001139 --
[CPU: 16.63 GB] Processed 4000.000 images with 2 models in 5.721s.
- Batchsize=4000, computation minibatch=4000
[CPU: 13.42 GB] -- Iteration 1
[CPU: 13.42 GB] Batch size 8000
[CPU: 13.42 GB] Using Alignment Radius 19.500 (13.391A)
[CPU: 13.42 GB] Using Reconstruction Radius 29.000 (9.004A)
[CPU: 13.43 GB] Randomizing assignments for identical classes...
[CPU: 13.43 GB] Number of BnB iterations 3
[CPU: 13.43 GB] DEV 0 THR 1 NUM 4000 TOTAL 0 ELAPSED 9.9518713 --
[CPU: 14.65 GB] Processed 4000.000 images with 2 models in 51.275s.
[CPU: 14.65 GB] DEV 0 THR 0 NUM 4000 TOTAL 0 ELAPSED 10.273054 --
[CPU: 14.88 GB] Processed 4000.000 images with 2 models in 60.427s.