Experiencing an error when running 3D variability

Hello All,

Our group has recently updated to v4.5.1 and I just ran into an issue when running the 3D variability job. I copied the traceback below. I have previously run 3DV on older versions of cryosparc so I’m not quite sure what the issue may be. If anyone has any insight on how to resolve this, I would greatly appreciate it.

Best,
Wil

Also, it throws the error right after it finishes the reconstruction and begins the iterations.
[CPU: 24.17 GB]
Using random seed 1385661333

[CPU: 24.17 GB]
Start iteration 0 of 20

[CPU: 24.17 GB]
batch 1 of 97

Traceback (most recent call last):
File “/lsi/local/pkg/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 1594, in load_module_image_cuda_python
handle = driver.cuModuleLoadDataEx(image, len(options), option_keys,
File “/lsi/local/pkg/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 348, in safe_cuda_api_call
return self._check_cuda_python_error(fname, libfn(*args))
File “/lsi/local/pkg/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 408, in _check_cuda_python_error
raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [CUresult.CUDA_ERROR_ILLEGAL_ADDRESS] Call to cuModuleLoadDataEx results in CUDA_ERROR_ILLEGAL_ADDRESS

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “cryosparc_master/cryosparc_compute/run.py”, line 115, in cryosparc_master.cryosparc_compute.run.main
File “cryosparc_master/cryosparc_compute/jobs/var3D/run.py”, line 543, in cryosparc_master.cryosparc_compute.jobs.var3D.run.run
File “cryosparc_master/cryosparc_compute/jobs/var3D/run.py”, line 428, in cryosparc_master.cryosparc_compute.jobs.var3D.run.run.M_step
File “cryosparc_master/cryosparc_compute/engine/newcuda_kernels.py”, line 7120, in cryosparc_master.cryosparc_compute.engine.newcuda_kernels.backproject_crossterms
File “cryosparc_master/cryosparc_compute/gpu/gpucore.py”, line 475, in cryosparc_master.cryosparc_compute.gpu.gpucore.context_dependent_memoize.wrapper
File “cryosparc_master/cryosparc_compute/engine/newcuda_kernels.py”, line 7102, in cryosparc_master.cryosparc_compute.engine.newcuda_kernels.get_backproject_components_kernels
File “/lsi/local/pkg/cryosparc/cryosparc_worker/cryosparc_compute/gpu/compiler.py”, line 214, in get_function
cufunc = self.get_module().get_function(name)
File “/lsi/local/pkg/cryosparc/cryosparc_worker/cryosparc_compute/gpu/compiler.py”, line 176, in get_module
mod = ctx.create_module_image(cubin)
File “/lsi/local/pkg/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 1437, in create_module_image
module = load_module_image(self, image)
File “/lsi/local/pkg/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 1536, in load_module_image
return load_module_image_cuda_python(context, image)
File “/lsi/local/pkg/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 1599, in load_module_image_cuda_python
raise CudaAPIError(e.code, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [CUresult.CUDA_ERROR_ILLEGAL_ADDRESS] cuModuleLoadDataEx error:

Welcome to the forum @salmen.

Please can you post the outputs of the command

cryosparcm cli "get_job('P12', 'J934', 'job_type', 'version', 'params_spec', 'instance_information')"

substituting the failed job’s actual project and job IDs for P12 and J934, respectively.

{‘_id’: ‘664df4b1efeae5ca38eb026b’, ‘instance_information’: {‘CUDA_version’: ‘11.8’, ‘available_memory’: ‘370.38GB’, ‘cpu_model’: ‘Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz’, ‘driver_version’: ‘12.4’, ‘gpu_info’: [{‘id’: 0, ‘mem’: 47608692736, ‘name’: ‘NVIDIA A40’, ‘pcie’: ‘0000:1f:00’}], ‘ofd_hard_limit’: 131072, ‘ofd_soft_limit’: 1024, ‘physical_cores’: 48, ‘platform_architecture’: ‘x86_64’, ‘platform_node’: ‘gpucomp-08.hpc.lsi.umich.edu’, ‘platform_release’: ‘4.18.0-513.24.1.el8_9.x86_64’, ‘platform_version’: ‘#1 SMP Thu Mar 14 14:20:09 EDT 2024’, ‘total_memory’: ‘376.01GB’, ‘used_memory’: ‘2.85GB’}, ‘job_type’: ‘var_3D’, ‘params_spec’: {‘var_filter_res’: {‘value’: 6}}, ‘project_uid’: ‘P263’, ‘uid’: ‘J617’, ‘version’: ‘v4.5.1’}

This suggests it ran out of VRAM, as I see this error in some other programs when they exceed VRAM limits. Why it would change between 4.5 and earlier versions I’m not sure; I don’t think 4.5 made any explicit changes to 3DVA?

Did something else have some GPU memory assigned?

Thank you for the insight. I am currently using a communal GPU resource with a queue system, so I’m not entirely sure if there was other jobs allocating GPU memory at the time.

Notably, I did try using binned data (from 800 to 200 box size) and the job ran completely without throwing any errors. The only other reason I can think of why the old job worked is because I had a slightly smaller particle set. (96,696 particles vs. 86,340 particles)

Probably box size and different particle set size then. Schedulers don’t usually mess up and dual-allocate resources. :wink:

Good you got it to work! :smile:

1 Like