Memory leak in cryosparc live?

We have been observing CUDA out of memory errors for some time in cryosparc live. The jobs start and run for several dozen micrographs then start producing out of memory errors, first intermittently, then reliably.

The same jobs run outside of the GUI produce no out of memory errors. The GPU and system RAM usage are less than half of the total available for the first few micrographs (as observed in nvidia-smi and htop, respectively). GPU RAM usage appears to slowly climb up over time until the error begins to happen.

This error happens even immediately after the system has been rebooted, and when no other users are using the server.

We are using cryosparc v 4.5.3.

Error message:

Traceback (most recent call last):
  File "/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 851, in _attempt_allocation
    return allocator()
  File "/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 1054, in allocator
    return driver.cuMemAlloc(size)
  File "/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 344, in verbose_cuda_api_call
    return self._check_cuda_python_error(fname, libfn(*args))
  File "/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 408, in _check_cuda_python_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [CUresult.CUDA_ERROR_OUT_OF_MEMORY] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "cryosparc_master/cryosparc_compute/jobs/rtp_workers/run.py", line 381, in cryosparc_master.cryosparc_compute.jobs.rtp_workers.run.rtp_worker
  File "cryosparc_master/cryosparc_compute/jobs/rtp_workers/run.py", line 450, in cryosparc_master.cryosparc_compute.jobs.rtp_workers.run.process_movie
  File "cryosparc_master/cryosparc_compute/jobs/rtp_workers/run.py", line 596, in cryosparc_master.cryosparc_compute.jobs.rtp_workers.run.do_patch_motion
  File "cryosparc_master/cryosparc_compute/jobs/rtp_workers/run.py", line 625, in cryosparc_master.cryosparc_compute.jobs.rtp_workers.run.do_patch_motion
  File "cryosparc_master/cryosparc_compute/jobs/rtp_workers/run.py", line 602, in cryosparc_master.cryosparc_compute.jobs.rtp_workers.run.do_patch_motion
  File "cryosparc_master/cryosparc_compute/jobs/motioncorrection/patchmotion.py", line 292, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.patchmotion.unbend_motion_correction
  File "cryosparc_master/cryosparc_compute/jobs/motioncorrection/patchmotion.py", line 394, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.patchmotion.unbend_motion_correction
  File "cryosparc_master/cryosparc_compute/gpu/gpucore.py", line 390, in cryosparc_master.cryosparc_compute.gpu.gpucore.EngineBaseThread.ensure_allocated
  File "/var/home/cryosparc_user/cryosparc_worker/cryosparc_compute/gpu/gpuarray.py", line 270, in empty
    return device_array(shape, dtype, stream=stream)
  File "/var/home/cryosparc_user/cryosparc_worker/cryosparc_compute/gpu/gpuarray.py", line 226, in device_array
    arr = GPUArray(shape=shape, strides=strides, dtype=dtype, stream=stream)
  File "/var/home/cryosparc_user/cryosparc_worker/cryosparc_compute/gpu/gpuarray.py", line 21, in __init__
    super().__init__(shape, strides, dtype, stream, gpu_data)
  File "/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/devicearray.py", line 103, in __init__
    gpu_data = devices.get_context().memalloc(self.alloc_size)
  File "/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 1372, in memalloc
    return self.memory_manager.memalloc(bytesize)
  File "/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 1056, in memalloc
    ptr = self._attempt_allocation(allocator)
  File "/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 863, in _attempt_allocation
    return allocator()
  File "/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 1054, in allocator
    return driver.cuMemAlloc(size)
  File "/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 344, in verbose_cuda_api_call
    return self._check_cuda_python_error(fname, libfn(*args))
  File "/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 408, in _check_cuda_python_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [CUresult.CUDA_ERROR_OUT_OF_MEMORY] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY

@abrilot Thanks for posting your observation.
Please can you post:

  1. the output of the command (on the CryoSPARC master)
    cryosparcm cli "get_scheduler_targets()"
    
  2. the name of the preprocessing scheduler lane
  3. output of the command (on the relevant CryoSPARC worker)
    nvidia-smi
    
  4. the movie format

cryosparcm cli “get_scheduler_targets()”

gives:

[{'cache_path': '/scratch', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 1, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 2, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 3, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 4, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 5, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 6, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 7, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}], 'hostname': 'hydra.biosci.utexas.edu', 'lane': 'Hydra', 'monitor_port': None, 'name': 'hydra.biosci.utexas.edu', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95], 'GPU': [0, 1, 2, 3, 4, 5, 6, 7], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]}, 'ssh_str': 'cryosparc_user@hydra.biosci.utexas.edu', 'title': 'Worker node hydra.biosci.utexas.edu', 'type': 'node', 'worker_bin_path': '/var/home/cryosparc_user/cryosparc_worker/bin/cryosparcw'}, {'cache_path': '/scratch/cryosparc_cache', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 1, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 2, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 3, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 4, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 5, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 6, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 7, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}], 'hostname': 'athena.biosci.utexas.edu', 'lane': 'Athena', 'monitor_port': None, 'name': 'athena.biosci.utexas.edu', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47], 'GPU': [0, 1, 2, 3, 4, 5, 6, 7], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]}, 'ssh_str': 'cryosparc_user@athena.biosci.utexas.edu', 'title': 'Worker node athena.biosci.utexas.edu', 'type': 'node', 'worker_bin_path': '/var/home/cryosparc_user/cryosparc_worker/bin/cryosparcw'}, {'cache_path': '/scratch/cryosparc_cache', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 1, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 2, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 3, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 4, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 5, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 6, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 7, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}], 'hostname': 'poseidon.biosci.utexas.edu', 'lane': 'Poseidon', 'monitor_port': None, 'name': 'poseidon.biosci.utexas.edu', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47], 'GPU': [0, 1, 2, 3, 4, 5, 6, 7], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]}, 'ssh_str': 'cryosparc_user@poseidon.biosci.utexas.edu', 'title': 'Worker node poseidon.biosci.utexas.edu', 'type': 'node', 'worker_bin_path': '/var/home/cryosparc_user/cryosparc_worker/bin/cryosparcw'}, {'cache_path': '/scratch/cryosparc_cache', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 1, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 2, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 3, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}], 'hostname': 'javelina.biosci.utexas.edu', 'lane': 'Javelina', 'monitor_port': None, 'name': 'javelina.biosci.utexas.edu', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], 'GPU': [0, 1, 2, 3], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]}, 'ssh_str': 'cryosparc_user@javelina.biosci.utexas.edu', 'title': 'Worker node javelina.biosci.utexas.edu', 'type': 'node', 'worker_bin_path': '/data1/local/home/cryosparc_user/cryosparc2_worker/bin/cryosparcw'}, {'cache_path': '/scratch/cryosparc_cache', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 1, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 2, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 3, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 4, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 5, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 6, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 7, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}], 'hostname': 'roadrunner.biosci.utexas.edu', 'lane': 'Roadrunner', 'monitor_port': None, 'name': 'roadrunner.biosci.utexas.edu', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], 'GPU': [0, 1, 2, 3, 4, 5, 6, 7], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]}, 'ssh_str': 'cryosparc_user@roadrunner.biosci.utexas.edu', 'title': 'Worker node roadrunner.biosci.utexas.edu', 'type': 'node', 'worker_bin_path': '/local/home/cryosparc_user/cryosparc2_worker/bin/cryosparcw'}]

This has occurred on all targets that have A5000’s, athena, poseidon, hydra. I tested it most recently to reproduce the error on poseidon.

nvidia-smi gives:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A5000               Off | 00000000:1B:00.0 Off |                  Off |
| 30%   29C    P8              20W / 230W |      9MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A5000               Off | 00000000:1C:00.0 Off |                  Off |
| 30%   28C    P8              19W / 230W |      9MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A5000               Off | 00000000:1D:00.0 Off |                  Off |
| 30%   29C    P8              17W / 230W |      9MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A5000               Off | 00000000:1E:00.0 Off |                  Off |
| 30%   29C    P8              18W / 230W |      9MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA RTX A5000               Off | 00000000:B2:00.0 Off |                  Off |
| 30%   30C    P8              23W / 230W |      9MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA RTX A5000               Off | 00000000:B3:00.0 Off |                  Off |
| 30%   30C    P8              21W / 230W |      9MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA RTX A5000               Off | 00000000:B4:00.0 Off |                  Off |
| 30%   31C    P8              27W / 230W |      9MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA RTX A5000               Off | 00000000:B5:00.0 Off |                  Off |
| 30%   30C    P8              16W / 230W |      9MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      5470      G   /usr/lib/xorg/Xorg                            4MiB |
|    1   N/A  N/A      5470      G   /usr/lib/xorg/Xorg                            4MiB |
|    2   N/A  N/A      5470      G   /usr/lib/xorg/Xorg                            4MiB |
|    3   N/A  N/A      5470      G   /usr/lib/xorg/Xorg                            4MiB |
|    4   N/A  N/A      5470      G   /usr/lib/xorg/Xorg                            4MiB |
|    5   N/A  N/A      5470      G   /usr/lib/xorg/Xorg                            4MiB |
|    6   N/A  N/A      5470      G   /usr/lib/xorg/Xorg                            4MiB |
|    7   N/A  N/A      5470      G   /usr/lib/xorg/Xorg                            4MiB |
+---------------------------------------------------------------------------------------+

The movie format is eer.

What are the

  1. movie dimensions (stored pixel counts in x and y)
  2. EER upsampling factor setting in CryoSPARC (if not default 2)
  3. EER number of fractions setting in CryoSPARC (if not default 40)

Are the CryoSPARC workers also used

  1. as workers for another CryoSPARC instance
  2. for non-CryoSPARC GPU compute tasks

Could such tasks be simultaneously running on the same GPU as a job from this CryoSPARC instance?

movie dimensions? Is this not standard for a .eer on a Falcon 4? I imagine it would be 4kx4k.

The upsampling is 2

The # fractions is 60.

These cryosparc workers are only used for a single cryosparc instance.

There were no other tasks running simultaneously. I was watching the server like a hawk, rebooted it to make sure there were no jobs left over, and was running the work on a week-end. It is theoretically possible someone can log in, but that is very rare on these servers, and I was checking to ensure that did not happen here. This is also a relatively consistent behavior where over time all the GPUs in a multi-GPU job will accumulate these errors which is not consistent with any usage pattern I’ve ever seen from our users.

In short, I’m 100% sure this behavior is arising from cryosparc live, not any other processes running on these GPUs.

I should add that after pausing the CSlive jobs, the GPU RAM usage immediately went back down to ~12 Mib of usage, indicating it was not another job causing problems, but correlated to the CSlive jobs running.

Is there some other information I can provide to help troubleshoot?

Thanks again for reporting, we’ve been looking into the issue but don’t have a clear understanding yet of the cause. Some additional questions:

  1. How many preprocessing workers were running in the session
  2. Was the session also running streaming 2D class and/or streaming refinement at the time it failed
  3. Were all the jobs in the Live session running on the same node?
  4. How exactly were the outside of GUI jobs run “The same jobs run outside of the GUI produce no out of memory errors.”

Answers below:

How many preprocessing workers were running in the session: I’ve had the same errors using 1-8 workers.

Was the session also running streaming 2D class and/or streaming refinement at the time it failed.

We generally only ever run streaming to 2D, not to refinement. I’m pretty sure it’s failed with only preprocessing before.

Were all the jobs in the Live session running on the same node? I’ve had this happen with them on the same node, and not on the same node. I’m pretty sure preprocessing jobs alone are causing the problems, as the GPUs running preprocessing are the ones where the GPU RAM usage climbs steadily.

How exactly were the outside of GUI jobs run “The same jobs run outside of the GUI produce no out of memory errors.”

Import movies, run jobs in sequence (motion-correction, ctf determination, picking, 2D) with identical parameters and data up to 2D classification.

Another hint of the cause for you, our Krios recently came back up, where we collect data as tiffs on a K3, and cryosparc live ran with no error. Perhaps there is something about how cryosparc live works with eer files that may be causing this issue?

@abrilot , we have attempted to reproduce this with similar inputs (EER F4 data, upsamp 2, frames 60) and were unable to reproduce this.

Do you have any other types of GPUs that you could try this on?

Have you gotten this error with any other data sets?