Heterogeneous Refinement fails in v 4.5.1

abrilot · October 9, 2024, 2:40pm

I haven’t had reports of this error in a while, but here’s the output from a job I was able to find that exhibited this error:

cryosparcm cli “get_job(‘P825’, ‘J46’, ‘job_type’, ‘version’, ‘instance_information’, ‘status’, ‘params_spec’, ‘errors_run’)”
{‘_id’: ‘6658eb9a68790124e2702a6a’, ‘errors_run’: [{‘message’: ‘[CUresult.CUDA_ERROR_INVALID_VALUE] Call to cuMemHostAlloc results in CUDA_ERROR_INVALID_VALUE’, ‘warning’: False}], ‘instance_information’: {‘CUDA_version’: ‘11.8’, ‘available_memory’: ‘218.91GB’, ‘cpu_model’: ‘Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz’, ‘driver_version’: ‘12.2’, ‘gpu_info’: [{‘id’: 0, ‘mem’: 25425608704, ‘name’: ‘NVIDIA RTX A5000’, ‘pcie’: ‘0000:1a:00’}, {‘id’: 1, ‘mem’: 25425608704, ‘name’: ‘NVIDIA RTX A5000’, ‘pcie’: ‘0000:1b:00’}, {‘id’: 2, ‘mem’: 25425608704, ‘name’: ‘NVIDIA RTX A5000’, ‘pcie’: ‘0000:3d:00’}, {‘id’: 3, ‘mem’: 25425608704, ‘name’: ‘NVIDIA RTX A5000’, ‘pcie’: ‘0000:3e:00’}, {‘id’: 4, ‘mem’: 25425608704, ‘name’: ‘NVIDIA RTX A5000’, ‘pcie’: ‘0000:88:00’}, {‘id’: 5, ‘mem’: 25425608704, ‘name’: ‘NVIDIA RTX A5000’, ‘pcie’: ‘0000:89:00’}, {‘id’: 6, ‘mem’: 25425608704, ‘name’: ‘NVIDIA RTX A5000’, ‘pcie’: ‘0000:b1:00’}, {‘id’: 7, ‘mem’: 25425608704, ‘name’: ‘NVIDIA RTX A5000’, ‘pcie’: ‘0000:b2:00’}], ‘ofd_hard_limit’: 1048576, ‘ofd_soft_limit’: 1024, ‘physical_cores’: 48, ‘platform_architecture’: ‘x86_64’, ‘platform_node’: ‘hydra’, ‘platform_release’: ‘5.4.0-181-generic’, ‘platform_version’: ‘#201-Ubuntu SMP Thu Mar 28 15:39:01 UTC 2024’, ‘total_memory’: ‘251.54GB’, ‘used_memory’: ‘30.19GB’}, ‘job_type’: ‘class_2D_new’, ‘params_spec’: {‘class2D_K’: {‘value’: 100}, ‘compute_num_gpus’: {‘value’: 1}, ‘compute_use_ssd’: {‘value’: False}}, ‘project_uid’: ‘P825’, ‘status’: ‘failed’, ‘uid’: ‘J46’, ‘version’: ‘v4.5.1’}

wtempel · October 9, 2024, 5:07pm

@abrilot Please can you post the output of the command

cryosparcm joblog P825 J46 | tail -n 40

abrilot · October 9, 2024, 5:21pm

== CUDA [822814] INFO – add pending dealloc: cuMemFree 39200000 bytes
== CUDA [822814] INFO – dealloc: cuMemFree 39200000 bytes
== CUDA [822814] DEBUG – call driver api: cuMemFree(<CUdeviceptr 140045562413056>)
== CUDA [822814] INFO – add pending dealloc: cuMemFreeHost ? bytes
== CUDA [822814] INFO – dealloc: cuMemFreeHost ? bytes
== CUDA [822814] DEBUG – call driver api: cuMemFreeHost(140045461749760)
Traceback (most recent call last):
== CUDA [822820] INFO – add pending dealloc: cuMemFree 32400000 bytes
File “/var/home/cryosparc_user/cryosparc_worker/cryosparc_compute/jobs/runcommon.py”, line 2294, in run_with_except_hook
== CUDA [822820] INFO – dealloc: cuMemFree 32400000 bytes
== CUDA [822820] DEBUG – call driver api: cuMemFree(<CUdeviceptr 140045226868736>)
== CUDA [822823] DEBUG – call driver api: cuCtxGetCurrent()
== CUDA [822823] DEBUG – call driver api: cuCtxGetDevice()
== CUDA [822823] DEBUG – call driver api: cuCtxPopCurrent()
set status to failed
**custom thread exception hook caught something
**** handle exception rc
run_old(*args, **kw)
File “cryosparc_master/cryosparc_compute/gpu/gpucore.py”, line 134, in cryosparc_master.cryosparc_compute.gpu.gpucore.GPUThread.run
File “cryosparc_master/cryosparc_compute/gpu/gpucore.py”, line 135, in cryosparc_master.cryosparc_compute.gpu.gpucore.GPUThread.run
File “cryosparc_master/cryosparc_compute/jobs/class2D/newrun.py”, line 639, in cryosparc_master.cryosparc_compute.jobs.class2D.newrun.class2D_engine_run.work
File “cryosparc_master/cryosparc_compute/engine/newengine.py”, line 1383, in cryosparc_master.cryosparc_compute.engine.newengine.EngineThread.compute_resid_pow
File “cryosparc_master/cryosparc_compute/gpu/gpucore.py”, line 374, in cryosparc_master.cryosparc_compute.gpu.gpucore.EngineBaseThread.ensure_allocated
File “/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/devices.py”, line 232, in _require_cuda_context
return fn(*args, **kws)
File “/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/api.py”, line 189, in pinned_array
buffer = current_context().memhostalloc(bytesize)
File “/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 1378, in memhostalloc
return self.memory_manager.memhostalloc(bytesize, mapped, portable, wc)
File “/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 889, in memhostalloc
pointer = allocator()
File “/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 884, in allocator
return driver.cuMemHostAlloc(size, flags)
File “/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 344, in verbose_cuda_api_call
return self._check_cuda_python_error(fname, libfn(*args))
File “/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 408, in _check_cuda_python_error
raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [CUresult.CUDA_ERROR_INVALID_VALUE] Call to cuMemHostAlloc results in CUDA_ERROR_INVALID_VALUE
========= main process now complete at 2024-05-30 16:50:25.091160.
========= monitor process now complete at 2024-05-30 16:50:25.096414.

wtempel · October 9, 2024, 9:42pm

Thanks @abrilot.
Is there a cluster workload manager, like gridengine or slurm, used in job submission?
Were other jobs (not necessarily CryoSPARC) running on node hydra when job J46 failed?

abrilot · October 10, 2024, 3:14am

no sge or slurm. We almost never have other users run jobs on our cryosparc servers, and this used to happen often enough that I doubt it was due to other jobs (though I haven’t seen users report this specific issue in some time).

wtempel · October 18, 2024, 6:35pm

@abrilot Please can you post the outputs of these commands on hydra:

uname -a
grep -v LICENSE_ID /var/home/cryosparc_user/cryosparc_worker/config.sh

abrilot · October 22, 2024, 5:35pm

Linux hydra 5.4.0-193-generic #213-Ubuntu SMP Fri Aug 2 19:14:16 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
cryosparc_user@hydra:~$ grep -v LICENSE_ID /var/home/cryosparc_user/cryosparc_worker/config.sh

export CRYOSPARC_USE_GPU=true
export CRYOSPARC_CUDA_PATH=“/usr/local/cuda”
export CRYOSPARC_DEVELOP=false
export NUMBA_CUDA_LOG_LEVEL=“DEBUG”
export NUMBA_CUDA_LOG_API_ARGS=1
export CRYOSPARC_NO_PAGELOCK=true

wtempel · October 29, 2024, 2:25pm

@abrilot Please can you run these commands on the hydra computer:

cd $(mktemp -d)
pwd # make a note so you can find the directory later
tar zcvf worker_config.tgz /var/home/cryosparc_user/cryosparc_worker/config.sh

and send us the worker_config.tgz file. I will let you know the email address via direct messaging.
We would like to examine the file for non-printing characters that may affect the CRYOSPARC_NO_PAGELOCK setting.

abrilot · October 30, 2024, 1:59am

To be clear, this issue seems resolved for us, so I don’t think we have any problems in our config file, I was just posting from files I still have from when we had this failure.

Let me know if you still want the config file.

wtempel · October 30, 2024, 3:28pm

Glad to hear that @abrilot. How did you resolve

?