Memory leak in cryosparc live?

abrilot · September 18, 2024, 2:44pm

We have been observing CUDA out of memory errors for some time in cryosparc live. The jobs start and run for several dozen micrographs then start producing out of memory errors, first intermittently, then reliably.

The same jobs run outside of the GUI produce no out of memory errors. The GPU and system RAM usage are less than half of the total available for the first few micrographs (as observed in nvidia-smi and htop, respectively). GPU RAM usage appears to slowly climb up over time until the error begins to happen.

This error happens even immediately after the system has been rebooted, and when no other users are using the server.

We are using cryosparc v 4.5.3.

Error message:

Traceback (most recent call last):
  File "/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 851, in _attempt_allocation
    return allocator()
  File "/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 1054, in allocator
    return driver.cuMemAlloc(size)
  File "/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 344, in verbose_cuda_api_call
    return self._check_cuda_python_error(fname, libfn(*args))
  File "/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 408, in _check_cuda_python_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [CUresult.CUDA_ERROR_OUT_OF_MEMORY] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "cryosparc_master/cryosparc_compute/jobs/rtp_workers/run.py", line 381, in cryosparc_master.cryosparc_compute.jobs.rtp_workers.run.rtp_worker
  File "cryosparc_master/cryosparc_compute/jobs/rtp_workers/run.py", line 450, in cryosparc_master.cryosparc_compute.jobs.rtp_workers.run.process_movie
  File "cryosparc_master/cryosparc_compute/jobs/rtp_workers/run.py", line 596, in cryosparc_master.cryosparc_compute.jobs.rtp_workers.run.do_patch_motion
  File "cryosparc_master/cryosparc_compute/jobs/rtp_workers/run.py", line 625, in cryosparc_master.cryosparc_compute.jobs.rtp_workers.run.do_patch_motion
  File "cryosparc_master/cryosparc_compute/jobs/rtp_workers/run.py", line 602, in cryosparc_master.cryosparc_compute.jobs.rtp_workers.run.do_patch_motion
  File "cryosparc_master/cryosparc_compute/jobs/motioncorrection/patchmotion.py", line 292, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.patchmotion.unbend_motion_correction
  File "cryosparc_master/cryosparc_compute/jobs/motioncorrection/patchmotion.py", line 394, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.patchmotion.unbend_motion_correction
  File "cryosparc_master/cryosparc_compute/gpu/gpucore.py", line 390, in cryosparc_master.cryosparc_compute.gpu.gpucore.EngineBaseThread.ensure_allocated
  File "/var/home/cryosparc_user/cryosparc_worker/cryosparc_compute/gpu/gpuarray.py", line 270, in empty
    return device_array(shape, dtype, stream=stream)
  File "/var/home/cryosparc_user/cryosparc_worker/cryosparc_compute/gpu/gpuarray.py", line 226, in device_array
    arr = GPUArray(shape=shape, strides=strides, dtype=dtype, stream=stream)
  File "/var/home/cryosparc_user/cryosparc_worker/cryosparc_compute/gpu/gpuarray.py", line 21, in __init__
    super().__init__(shape, strides, dtype, stream, gpu_data)
  File "/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/devicearray.py", line 103, in __init__
    gpu_data = devices.get_context().memalloc(self.alloc_size)
  File "/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 1372, in memalloc
    return self.memory_manager.memalloc(bytesize)
  File "/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 1056, in memalloc
    ptr = self._attempt_allocation(allocator)
  File "/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 863, in _attempt_allocation
    return allocator()
  File "/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 1054, in allocator
    return driver.cuMemAlloc(size)
  File "/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 344, in verbose_cuda_api_call
    return self._check_cuda_python_error(fname, libfn(*args))
  File "/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 408, in _check_cuda_python_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [CUresult.CUDA_ERROR_OUT_OF_MEMORY] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY

wtempel · September 18, 2024, 3:00pm

@abrilot Thanks for posting your observation.
Please can you post:

the output of the command (on the CryoSPARC master)
```
cryosparcm cli "get_scheduler_targets()"
```
the name of the preprocessing scheduler lane
output of the command (on the relevant CryoSPARC worker)
```
nvidia-smi
```
the movie format

abrilot · September 18, 2024, 3:41pm

cryosparcm cli “get_scheduler_targets()”

gives:

[{'cache_path': '/scratch', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 1, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 2, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 3, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 4, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 5, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 6, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 7, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}], 'hostname': 'hydra.biosci.utexas.edu', 'lane': 'Hydra', 'monitor_port': None, 'name': 'hydra.biosci.utexas.edu', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95], 'GPU': [0, 1, 2, 3, 4, 5, 6, 7], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]}, 'ssh_str': 'cryosparc_user@hydra.biosci.utexas.edu', 'title': 'Worker node hydra.biosci.utexas.edu', 'type': 'node', 'worker_bin_path': '/var/home/cryosparc_user/cryosparc_worker/bin/cryosparcw'}, {'cache_path': '/scratch/cryosparc_cache', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 1, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 2, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 3, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 4, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 5, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 6, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 7, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}], 'hostname': 'athena.biosci.utexas.edu', 'lane': 'Athena', 'monitor_port': None, 'name': 'athena.biosci.utexas.edu', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47], 'GPU': [0, 1, 2, 3, 4, 5, 6, 7], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]}, 'ssh_str': 'cryosparc_user@athena.biosci.utexas.edu', 'title': 'Worker node athena.biosci.utexas.edu', 'type': 'node', 'worker_bin_path': '/var/home/cryosparc_user/cryosparc_worker/bin/cryosparcw'}, {'cache_path': '/scratch/cryosparc_cache', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 1, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 2, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 3, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 4, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 5, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 6, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 7, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}], 'hostname': 'poseidon.biosci.utexas.edu', 'lane': 'Poseidon', 'monitor_port': None, 'name': 'poseidon.biosci.utexas.edu', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47], 'GPU': [0, 1, 2, 3, 4, 5, 6, 7], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]}, 'ssh_str': 'cryosparc_user@poseidon.biosci.utexas.edu', 'title': 'Worker node poseidon.biosci.utexas.edu', 'type': 'node', 'worker_bin_path': '/var/home/cryosparc_user/cryosparc_worker/bin/cryosparcw'}, {'cache_path': '/scratch/cryosparc_cache', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 1, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 2, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 3, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}], 'hostname': 'javelina.biosci.utexas.edu', 'lane': 'Javelina', 'monitor_port': None, 'name': 'javelina.biosci.utexas.edu', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], 'GPU': [0, 1, 2, 3], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]}, 'ssh_str': 'cryosparc_user@javelina.biosci.utexas.edu', 'title': 'Worker node javelina.biosci.utexas.edu', 'type': 'node', 'worker_bin_path': '/data1/local/home/cryosparc_user/cryosparc2_worker/bin/cryosparcw'}, {'cache_path': '/scratch/cryosparc_cache', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 1, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 2, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 3, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 4, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 5, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 6, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}, {'id': 7, 'mem': 25425608704, 'name': 'NVIDIA RTX A5000'}], 'hostname': 'roadrunner.biosci.utexas.edu', 'lane': 'Roadrunner', 'monitor_port': None, 'name': 'roadrunner.biosci.utexas.edu', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], 'GPU': [0, 1, 2, 3, 4, 5, 6, 7], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]}, 'ssh_str': 'cryosparc_user@roadrunner.biosci.utexas.edu', 'title': 'Worker node roadrunner.biosci.utexas.edu', 'type': 'node', 'worker_bin_path': '/local/home/cryosparc_user/cryosparc2_worker/bin/cryosparcw'}]

This has occurred on all targets that have A5000’s, athena, poseidon, hydra. I tested it most recently to reproduce the error on poseidon.

nvidia-smi gives:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A5000               Off | 00000000:1B:00.0 Off |                  Off |
| 30%   29C    P8              20W / 230W |      9MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A5000               Off | 00000000:1C:00.0 Off |                  Off |
| 30%   28C    P8              19W / 230W |      9MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A5000               Off | 00000000:1D:00.0 Off |                  Off |
| 30%   29C    P8              17W / 230W |      9MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A5000               Off | 00000000:1E:00.0 Off |                  Off |
| 30%   29C    P8              18W / 230W |      9MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA RTX A5000               Off | 00000000:B2:00.0 Off |                  Off |
| 30%   30C    P8              23W / 230W |      9MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA RTX A5000               Off | 00000000:B3:00.0 Off |                  Off |
| 30%   30C    P8              21W / 230W |      9MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA RTX A5000               Off | 00000000:B4:00.0 Off |                  Off |
| 30%   31C    P8              27W / 230W |      9MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA RTX A5000               Off | 00000000:B5:00.0 Off |                  Off |
| 30%   30C    P8              16W / 230W |      9MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      5470      G   /usr/lib/xorg/Xorg                            4MiB |
|    1   N/A  N/A      5470      G   /usr/lib/xorg/Xorg                            4MiB |
|    2   N/A  N/A      5470      G   /usr/lib/xorg/Xorg                            4MiB |
|    3   N/A  N/A      5470      G   /usr/lib/xorg/Xorg                            4MiB |
|    4   N/A  N/A      5470      G   /usr/lib/xorg/Xorg                            4MiB |
|    5   N/A  N/A      5470      G   /usr/lib/xorg/Xorg                            4MiB |
|    6   N/A  N/A      5470      G   /usr/lib/xorg/Xorg                            4MiB |
|    7   N/A  N/A      5470      G   /usr/lib/xorg/Xorg                            4MiB |
+---------------------------------------------------------------------------------------+

The movie format is eer.

wtempel · September 18, 2024, 4:44pm

What are the

movie dimensions (stored pixel counts in x and y)
EER upsampling factor setting in CryoSPARC (if not default 2)
EER number of fractions setting in CryoSPARC (if not default 40)

Are the CryoSPARC workers also used

as workers for another CryoSPARC instance
for non-CryoSPARC GPU compute tasks

Could such tasks be simultaneously running on the same GPU as a job from this CryoSPARC instance?

abrilot · September 18, 2024, 8:23pm

movie dimensions? Is this not standard for a .eer on a Falcon 4? I imagine it would be 4kx4k.

The upsampling is 2

The # fractions is 60.

These cryosparc workers are only used for a single cryosparc instance.

There were no other tasks running simultaneously. I was watching the server like a hawk, rebooted it to make sure there were no jobs left over, and was running the work on a week-end. It is theoretically possible someone can log in, but that is very rare on these servers, and I was checking to ensure that did not happen here. This is also a relatively consistent behavior where over time all the GPUs in a multi-GPU job will accumulate these errors which is not consistent with any usage pattern I’ve ever seen from our users.

In short, I’m 100% sure this behavior is arising from cryosparc live, not any other processes running on these GPUs.

abrilot · September 18, 2024, 9:19pm

I should add that after pausing the CSlive jobs, the GPU RAM usage immediately went back down to ~12 Mib of usage, indicating it was not another job causing problems, but correlated to the CSlive jobs running.

abrilot · September 24, 2024, 7:22pm

Is there some other information I can provide to help troubleshoot?

fhu · September 25, 2024, 7:41pm

Thanks again for reporting, we’ve been looking into the issue but don’t have a clear understanding yet of the cause. Some additional questions:

How many preprocessing workers were running in the session
Was the session also running streaming 2D class and/or streaming refinement at the time it failed
Were all the jobs in the Live session running on the same node?
How exactly were the outside of GUI jobs run “The same jobs run outside of the GUI produce no out of memory errors.”

abrilot · September 26, 2024, 4:47pm

Answers below:

How many preprocessing workers were running in the session: I’ve had the same errors using 1-8 workers.

Was the session also running streaming 2D class and/or streaming refinement at the time it failed.

We generally only ever run streaming to 2D, not to refinement. I’m pretty sure it’s failed with only preprocessing before.

Were all the jobs in the Live session running on the same node? I’ve had this happen with them on the same node, and not on the same node. I’m pretty sure preprocessing jobs alone are causing the problems, as the GPUs running preprocessing are the ones where the GPU RAM usage climbs steadily.

How exactly were the outside of GUI jobs run “The same jobs run outside of the GUI produce no out of memory errors.”

Import movies, run jobs in sequence (motion-correction, ctf determination, picking, 2D) with identical parameters and data up to 2D classification.

abrilot · September 26, 2024, 4:57pm

Another hint of the cause for you, our Krios recently came back up, where we collect data as tiffs on a K3, and cryosparc live ran with no error. Perhaps there is something about how cryosparc live works with eer files that may be causing this issue?

fhu · October 16, 2024, 8:18pm

@abrilot , we have attempted to reproduce this with similar inputs (EER F4 data, upsamp 2, frames 60) and were unable to reproduce this.

Do you have any other types of GPUs that you could try this on?

Have you gotten this error with any other data sets?

abrilot · October 22, 2024, 5:41pm

Have you gotten this error with any other data sets?

Yes. Frequently.

We may have some old 1080s still on the system. I’m not sure how well that would work though.

wtempel · October 22, 2024, 6:26pm

Thanks @abrilot for these details.
Where (specific GUI section or log file) did you observe the error you posted at the beginning of this topic:

Do you still have the corresponding file and can post the first 40 lines?

abrilot · October 23, 2024, 2:41pm

I have a few jobs that I haven’t cleared that have this error so posting the requested text should be no problem.

Once it occurs it tends to happen in all worker jobs for a CSlive session, so it’s unclear where exactly it first happens.

Can you remind me how to post the first 40 lines of a log file? I have an idea but I’d prefer to make sure I post the correct text.

wtempel · October 23, 2024, 6:07pm

head -n 40 /path/to/logfile if the file is on disk.
In case you observed the error in the event log of a Live Preprocessing Worker job, please post the output of the command

cryosparcm eventlog P99 J2 | sed '/Job ready to run/q'

(after replacing P99, J2 with the actual project and job IDs).

Please also let us know the name or type of the log file.

abrilot · October 28, 2024, 7:42pm

cryosparc_user@cerebro:~$ cryosparcm eventlog P894 J93 | sed '/Job ready to run/q'
[Mon, 16 Sep 2024 20:15:18 GMT]  License is valid.
[Mon, 16 Sep 2024 20:15:18 GMT]  Launching job on lane Athena target athena.biosci.utexas.edu ...
[Mon, 16 Sep 2024 20:15:18 GMT]  Running job on remote worker node hostname athena.biosci.utexas.edu
[Mon, 16 Sep 2024 20:15:19 GMT] [CPU RAM used: 92 MB] Job J93 Started
[Mon, 16 Sep 2024 20:15:19 GMT] [CPU RAM used: 92 MB] Master running v4.5.3, worker running v4.5.3
[Mon, 16 Sep 2024 20:15:19 GMT] [CPU RAM used: 92 MB] Working in directory: /cryosparc/EvanSchwartz/CS-24sep14a-apof-glacios/J93
[Mon, 16 Sep 2024 20:15:19 GMT] [CPU RAM used: 92 MB] Running on lane Athena
[Mon, 16 Sep 2024 20:15:19 GMT] [CPU RAM used: 92 MB] Resources allocated:
[Mon, 16 Sep 2024 20:15:19 GMT] [CPU RAM used: 92 MB]   Worker:  athena.biosci.utexas.edu
[Mon, 16 Sep 2024 20:15:19 GMT] [CPU RAM used: 92 MB]   CPU   :  [6, 7, 8, 9, 10, 11]
[Mon, 16 Sep 2024 20:15:19 GMT] [CPU RAM used: 92 MB]   GPU   :  [1]
[Mon, 16 Sep 2024 20:15:19 GMT] [CPU RAM used: 92 MB]   RAM   :  [2, 3]
[Mon, 16 Sep 2024 20:15:19 GMT] [CPU RAM used: 92 MB]   SSD   :  False
[Mon, 16 Sep 2024 20:15:19 GMT] [CPU RAM used: 92 MB] --------------------------------------------------------------
[Mon, 16 Sep 2024 20:15:19 GMT] [CPU RAM used: 92 MB] Importing job module for job type rtp_worker...
[Mon, 16 Sep 2024 20:15:27 GMT] [CPU RAM used: 337 MB] Job ready to run
Traceback (most recent call last):
  File "<string>", line 9, in <module>
BrokenPipeError: [Errno 32] Broken pipe
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>
BrokenPipeError: [Errno 32] Broken pipe
cryosparc_user@cerebro:~$ cryosparcm eventlog P894 J94 | sed '/Job ready to run/q'
[Mon, 16 Sep 2024 20:15:20 GMT]  License is valid.
[Mon, 16 Sep 2024 20:15:20 GMT]  Launching job on lane Athena target athena.biosci.utexas.edu ...
[Mon, 16 Sep 2024 20:15:20 GMT]  Running job on remote worker node hostname athena.biosci.utexas.edu
[Mon, 16 Sep 2024 20:15:22 GMT] [CPU RAM used: 91 MB] Job J94 Started
[Mon, 16 Sep 2024 20:15:22 GMT] [CPU RAM used: 91 MB] Master running v4.5.3, worker running v4.5.3
[Mon, 16 Sep 2024 20:15:22 GMT] [CPU RAM used: 91 MB] Working in directory: /cryosparc/EvanSchwartz/CS-24sep14a-apof-glacios/J94
[Mon, 16 Sep 2024 20:15:22 GMT] [CPU RAM used: 91 MB] Running on lane Athena
[Mon, 16 Sep 2024 20:15:22 GMT] [CPU RAM used: 91 MB] Resources allocated:
[Mon, 16 Sep 2024 20:15:22 GMT] [CPU RAM used: 91 MB]   Worker:  athena.biosci.utexas.edu
[Mon, 16 Sep 2024 20:15:22 GMT] [CPU RAM used: 91 MB]   CPU   :  [12, 13, 14, 15, 16, 17]
[Mon, 16 Sep 2024 20:15:22 GMT] [CPU RAM used: 91 MB]   GPU   :  [2]
[Mon, 16 Sep 2024 20:15:22 GMT] [CPU RAM used: 91 MB]   RAM   :  [4, 5]
[Mon, 16 Sep 2024 20:15:22 GMT] [CPU RAM used: 91 MB]   SSD   :  False
[Mon, 16 Sep 2024 20:15:22 GMT] [CPU RAM used: 91 MB] --------------------------------------------------------------
[Mon, 16 Sep 2024 20:15:22 GMT] [CPU RAM used: 91 MB] Importing job module for job type rtp_worker...
[Mon, 16 Sep 2024 20:15:29 GMT] [CPU RAM used: 337 MB] Job ready to run
Traceback (most recent call last):
  File "<string>", line 9, in <module>
BrokenPipeError: [Errno 32] Broken pipe
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>
BrokenPipeError: [Errno 32] Broken pipe

Not sure what you mean by the name or type of logfile, but this was a preprocessing job.

wtempel · November 1, 2024, 7:37pm

Sometimes it is unclear from what type of certain information was extracted. It is clear in this instance that the information came from an event log.

For an example where you expect steadily increasing VRAM use during a CryoSPARC Live session, you may want to start tracking VRAM use by running this command on the relevant worker, inside a directory that can store the growing log:

watch -t -n 60 "((date && nvidia-smi --query-gpu=index,uuid,name,memory.used --format=csv) | tee -a nv$(date "+%Y%m%d%H%M%S").watch)"

just before the start of processing.
Interrupt the command, which may otherwise continue indefinitely (or until log space runs out), when enough information has been collected, compress the output file and email it to us along with the session document, which you can obtain with the command

csprojectid=P99 # replace with actual project ID
cssessionid=S9 # replace with actual Live session ID
cryosparcm rtpcli "get_session('$csprojectid', '$cssessionid')" > ${csprojectid}_${cssessionid}.doc