Error in Benchmark Job after Updating to v4.4.1

ynarui · January 31, 2024, 6:10pm

After updating cryoSPARC to v4.4.1 (cluster), I tried running the Benchmark job under Instance Testing Utilities, but it failed to finish. No other jobs were running at the time.

This message appeared 3 times in the event log:

Failed to complete GPU benchmark on GPU 0: [CUresult.CUDA_ERROR_OUT_OF_MEMORY] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY

And this message at the end of the log:

 [CPU:   9.06 GB] Traceback (most recent call last):
  File "/opt/cryosparc2_worker/cryosparc_compute/jobs/runcommon.py", line 2192, in run_with_except_hook
    run_old(*args, **kw)
  File "cryosparc_master/cryosparc_compute/gpu/gpucore.py", line 134, in cryosparc_master.cryosparc_compute.gpu.gpucore.GPUThread.run
  File "cryosparc_master/cryosparc_compute/gpu/gpucore.py", line 135, in cryosparc_master.cryosparc_compute.gpu.gpucore.GPUThread.run
  File "cryosparc_master/cryosparc_compute/jobs/class2D/newrun.py", line 605, in cryosparc_master.cryosparc_compute.jobs.class2D.newrun.class2D_engine_run.work
  File "cryosparc_master/cryosparc_compute/engine/newengine.py", line 629, in cryosparc_master.cryosparc_compute.engine.newengine.EngineThread.preprocess_image_data
  File "cryosparc_master/cryosparc_compute/engine/newengine.py", line 623, in cryosparc_master.cryosparc_compute.engine.newengine.EngineThread.preprocess_image_data
  File "cryosparc_master/cryosparc_compute/engine/newgfourier.py", line 22, in cryosparc_master.cryosparc_compute.engine.newgfourier.get_plan_R2C_2D
  File "/opt/cryosparc2_worker/cryosparc_compute/skcuda_internal/fft.py", line 112, in __init__
    self.handle = gpufft.gpufft_get_plan(
RuntimeError: cuda failure (driver API): cuMemAlloc(&plan_cache.plans[idx].workspace, plan_cache.plans[idx].worksz)
-> CUDA_ERROR_OUT_OF_MEMORY  out of memory

I have not had any issues running other jobs on this version of cryoSPARC yet, but I have not thoroughly tested it. Would like to know the root cause of this and any potential solutions. Please let me know if any additional details about the system would help. Thanks!

wtempel · February 1, 2024, 3:24pm

Please can you post the output of this command

cryosparcm cli "get_job('PX', 'JY', 'instance_information')"

where you replace PX, JY, ith the actual project and job UIDs, respectively.

ynarui · February 1, 2024, 7:57pm

Here is the output:

{'_id': '65ba748e0a3d9a7ce2045da7', 'instance_information': {'CUDA_version': '11.8', 'available_memory': '172.88GB', 'cpu_model': 'Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz', 'driver_version': '12.1', 'gpu_info': [{'id': 0, 'mem': 11546394624, 'name': 'NVIDIA GeForce RTX 2080 Ti'}], 'ofd_hard_limit': 51200, 'ofd_soft_limit': 1024, 'physical_cores': 16, 'platform_architecture': 'x86_64', 'platform_node': 'gpu01', 'platform_release': '4.15.0-213-generic', 'platform_version': '#224-Ubuntu SMP Mon Jun 19 13:30:12 UTC 2023', 'total_memory': '187.57GB', 'used_memory': '13.30GB'}, 'project_uid': 'P68', 'uid': 'J1417'}

wtempel · February 1, 2024, 8:34pm

Thanks for the information. Were the CUDA_ERROR_OUT_OF_MEMORY messages associated with specific components of the benchmark job? Did the job complete nevertheless?

ynarui · February 1, 2024, 9:20pm

This is the portion of the Event Log where the errors appear:

For this job, I selected a 1 GPU lane, but I see slightly different errors when I run on a 4 GPU lane (that job also failed).

wtempel · February 1, 2024, 11:40pm

Thanks @ynarui for posting the screenshot. Please can you show the full traceback, an apparent outright error, that is shown partially at the bottom?

ynarui · February 2, 2024, 3:32am

The full text for that error message is in my original post. Here it is again:

wtempel · February 2, 2024, 3:18pm

Please can you email us the job reports for these two jobs?

wtempel · February 2, 2024, 7:28pm

Thanks @ynarui for emailing us the job reports. I wonder whether the cluster is configured to

isolate GPU resources between concurrently running jobs
restrict the number of processes running on a single GPU.

To find out, please can you run the commands

srun nvidia-smi
nvidia-smi # not via slurm

on host gpu01 and post their outputs.
[edited for grammar]