Assistance Needed with 2D Classification Memory Issue in CryoSparc v4.5.3

Philipp · July 11, 2024, 8:45am

Hi Team,

Firstly, I want to thank for everyone’s contributions to the development of this wonderful software.

We are currently using CryoSparc v4.5.3 and have encountered an issue during 2D Classification runs. The problem appears to be related to memory management and turned up recently. Previous classifications in the same project, with comparable particle amounts and box sizes, completed without any issues. However, we run into the following error:

Traceback (most recent call last):
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/cryosparc_compute/jobs/runcommon.py", line 2294, in run_with_except_hook
    run_old(*args, **kw)
  File "cryosparc_master/cryosparc_compute/gpu/gpucore.py", line 134, in cryosparc_master.cryosparc_compute.gpu.gpucore.GPUThread.run
  File "cryosparc_master/cryosparc_compute/gpu/gpucore.py", line 135, in cryosparc_master.cryosparc_compute.gpu.gpucore.GPUThread.run
  File "cryosparc_master/cryosparc_compute/jobs/class2D/newrun.py", line 639, in cryosparc_master.cryosparc_compute.jobs.class2D.newrun.class2D_engine_run.work
  File "cryosparc_master/cryosparc_compute/engine/newengine.py", line 1383, in cryosparc_master.cryosparc_compute.engine.newengine.EngineThread.compute_resid_pow
  File "cryosparc_master/cryosparc_compute/gpu/gpucore.py", line 374, in cryosparc_master.cryosparc_compute.gpu.gpucore.EngineBaseThread.ensure_allocated
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/devices.py", line 232, in _require_cuda_context
    return fn(*args, **kws)
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/api.py", line 189, in pinned_array
    buffer = current_context().memhostalloc(bytesize)
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 1378, in memhostalloc
    return self.memory_manager.memhostalloc(bytesize, mapped, portable, wc)
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 889, in memhostalloc
    pointer = allocator()
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 884, in allocator
    return driver.cuMemHostAlloc(size, flags)
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 348, in safe_cuda_api_call
    return self._check_cuda_python_error(fname, libfn(*args))
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 408, in _check_cuda_python_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [CUresult.CUDA_ERROR_INVALID_VALUE] Call to cuMemHostAlloc results in CUDA_ERROR_INVALID_VALUE

We aren’t sure if this issue stems from the CUDA driver or if it is simply a case of our GPUs running out of memory. Is there any way of having a look at the estimated memory for certain jobs? We have tested this on two setups: one with 4x2080Ti (4 x 11GB) and another with 4x3090 (4 x 24GB).

Thank you very much for your help!

Cheers,
Philipp

wtempel · July 11, 2024, 1:13pm

Welcome to the forum @Philipp .
cuMemHostAlloc errors could indicate that the host system, not the the GPU, is running out of RAM. Please can you provide additional details:

outputs of the command
free -h on your systems
the box size of your particles
the number of particles

outputs of the commands

cryosparcm eventlog P99 J199 | tail -n 40
cryosparcm cli "get_job('P99', 'J199', 'job_type', 'params_spec')"

where you replace P99 , J199 with the failed job’s project and job IDs, respectively

description and number of other compute loads (CryoSPARC or non-CryoSPARC) that may have run concurrently with the failed job

Philipp · July 11, 2024, 5:45pm

Hi wtempel,

thank you for the fast reply and the help!

Regarding your post:

Particle amount: 3 Million particles
Box Size: 400px with Fourier crop to 200px

And the output of the commands:

free -h

total         used        free        shared     buff/cache  available
Mem:          376Gi       16Gi        17Gi       1.5Gi       342Gi       356Gi
Swap:         8.0Gi       272Mi       7.7Gi

cryosparcm cli "get_job('P68', 'J155', 'job_type', 'params_spec')

{'_id': '668f7fcd193e5b86a429d4e8', 'job_type': 'class_2D_new', 'params_spec': >>>{'class2D_K': {'value': 150}, 'compute_num_gpus': {'value': 4}}, 'project_uid': 'P68', 'uid': 'J155'}

cryosparcm eventlog P68 J155 | tail -n 40

  File "cryosparc_master/cryosparc_compute/engine/newengine.py", line 1383, in cryosparc_master.cryosparc_compute.engine.newengine.EngineThread.compute_resid_pow
  File "cryosparc_master/cryosparc_compute/gpu/gpucore.py", line 374, in cryosparc_master.cryosparc_compute.gpu.gpucore.EngineBaseThread.ensure_allocated
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/devices.py", line 232, in _require_cuda_context
    return fn(*args, **kws)
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/api.py", line 189, in pinned_array
    buffer = current_context().memhostalloc(bytesize)
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 1378, in memhostalloc
    return self.memory_manager.memhostalloc(bytesize, mapped, portable, wc)
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 889, in memhostalloc
    pointer = allocator()
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 884, in allocator
    return driver.cuMemHostAlloc(size, flags)
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 348, in safe_cuda_api_call
    return self._check_cuda_python_error(fname, libfn(*args))
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 408, in _check_cuda_python_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [CUresult.CUDA_ERROR_INVALID_VALUE] Call to cuMemHostAlloc results in CUDA_ERROR_INVALID_VALUE
[Thu, 11 Jul 2024 06:48:15 GMT] [CPU RAM used: 10436 MB] Traceback (most recent call last):
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/cryosparc_compute/jobs/runcommon.py", line 2294, in run_with_except_hook
    run_old(*args, **kw)
  File "cryosparc_master/cryosparc_compute/gpu/gpucore.py", line 134, in cryosparc_master.cryosparc_compute.gpu.gpucore.GPUThread.run
  File "cryosparc_master/cryosparc_compute/gpu/gpucore.py", line 135, in cryosparc_master.cryosparc_compute.gpu.gpucore.GPUThread.run
  File "cryosparc_master/cryosparc_compute/jobs/class2D/newrun.py", line 639, in cryosparc_master.cryosparc_compute.jobs.class2D.newrun.class2D_engine_run.work
  File "cryosparc_master/cryosparc_compute/engine/newengine.py", line 1383, in cryosparc_master.cryosparc_compute.engine.newengine.EngineThread.compute_resid_pow
  File "cryosparc_master/cryosparc_compute/gpu/gpucore.py", line 374, in cryosparc_master.cryosparc_compute.gpu.gpucore.EngineBaseThread.ensure_allocated
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/devices.py", line 232, in _require_cuda_context
    return fn(*args, **kws)
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/api.py", line 189, in pinned_array
    buffer = current_context().memhostalloc(bytesize)
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 1378, in memhostalloc
    return self.memory_manager.memhostalloc(bytesize, mapped, portable, wc)
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 889, in memhostalloc
    pointer = allocator()
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 884, in allocator
    return driver.cuMemHostAlloc(size, flags)
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 348, in safe_cuda_api_call
    return self._check_cuda_python_error(fname, libfn(*args))
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 408, in _check_cuda_python_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [CUresult.CUDA_ERROR_INVALID_VALUE] Call to cuMemHostAlloc results in CUDA_ERROR_INVALID_VALUE

wtempel · July 15, 2024, 8:17pm

@Philipp Please can you try if adding the line

export CRYOSPARC_NO_PAGELOCK=true

to the file

/home/cryosparcuser/cryosparc/cryosparc_worker/config.sh

resolves this issue.
What version of the Linux kernel is running on the worker computer? You can find out with the command:

uname -a

Philipp · July 19, 2024, 9:04am

@wtempel

We’re currently monitoring whether the adjustment resolves the issue. Since the error was sporadic and not easily reproducible, we’ll need some time to confirm if everything is working smoothly again.

I’ll follow up once we are sure if this is a valid workaround or if the error persists.

Have a great weekend and thanks for the help!

Best,
Philipp

newbie · July 29, 2024, 1:59am

Hello,

I also faced the same error earlier with a similar log. I’ve changed the no_pagelock to true an ran a clone of the job while increasing the allocated ram to 3x what was needed (because it failed with 2x more) and it worked with no errors. The error hasn’t popped up again so far.

Philipp · August 1, 2024, 2:22pm

@wtempel
@newbie

After monitoring our CryoSparc instance, we found that several users were able to complete multiple jobs without any further issues. As such, I have marked your reply as the solution.

Thank you very much!

I will provide updated information regarding our kernel version once our administrator returns from his holiday.

In the meantime, is there anything else we should be aware of, as this could potentially “only” be a workaround for incompatible drivers or other underlying issues.

Kind regards,

Philipp