Heterogeneous Refinement fails in v 4.5.1

I haven’t had reports of this error in a while, but here’s the output from a job I was able to find that exhibited this error:

cryosparcm cli “get_job(‘P825’, ‘J46’, ‘job_type’, ‘version’, ‘instance_information’, ‘status’, ‘params_spec’, ‘errors_run’)”
{‘_id’: ‘6658eb9a68790124e2702a6a’, ‘errors_run’: [{‘message’: ‘[CUresult.CUDA_ERROR_INVALID_VALUE] Call to cuMemHostAlloc results in CUDA_ERROR_INVALID_VALUE’, ‘warning’: False}], ‘instance_information’: {‘CUDA_version’: ‘11.8’, ‘available_memory’: ‘218.91GB’, ‘cpu_model’: ‘Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz’, ‘driver_version’: ‘12.2’, ‘gpu_info’: [{‘id’: 0, ‘mem’: 25425608704, ‘name’: ‘NVIDIA RTX A5000’, ‘pcie’: ‘0000:1a:00’}, {‘id’: 1, ‘mem’: 25425608704, ‘name’: ‘NVIDIA RTX A5000’, ‘pcie’: ‘0000:1b:00’}, {‘id’: 2, ‘mem’: 25425608704, ‘name’: ‘NVIDIA RTX A5000’, ‘pcie’: ‘0000:3d:00’}, {‘id’: 3, ‘mem’: 25425608704, ‘name’: ‘NVIDIA RTX A5000’, ‘pcie’: ‘0000:3e:00’}, {‘id’: 4, ‘mem’: 25425608704, ‘name’: ‘NVIDIA RTX A5000’, ‘pcie’: ‘0000:88:00’}, {‘id’: 5, ‘mem’: 25425608704, ‘name’: ‘NVIDIA RTX A5000’, ‘pcie’: ‘0000:89:00’}, {‘id’: 6, ‘mem’: 25425608704, ‘name’: ‘NVIDIA RTX A5000’, ‘pcie’: ‘0000:b1:00’}, {‘id’: 7, ‘mem’: 25425608704, ‘name’: ‘NVIDIA RTX A5000’, ‘pcie’: ‘0000:b2:00’}], ‘ofd_hard_limit’: 1048576, ‘ofd_soft_limit’: 1024, ‘physical_cores’: 48, ‘platform_architecture’: ‘x86_64’, ‘platform_node’: ‘hydra’, ‘platform_release’: ‘5.4.0-181-generic’, ‘platform_version’: ‘#201-Ubuntu SMP Thu Mar 28 15:39:01 UTC 2024’, ‘total_memory’: ‘251.54GB’, ‘used_memory’: ‘30.19GB’}, ‘job_type’: ‘class_2D_new’, ‘params_spec’: {‘class2D_K’: {‘value’: 100}, ‘compute_num_gpus’: {‘value’: 1}, ‘compute_use_ssd’: {‘value’: False}}, ‘project_uid’: ‘P825’, ‘status’: ‘failed’, ‘uid’: ‘J46’, ‘version’: ‘v4.5.1’}

@abrilot Please can you post the output of the command

cryosparcm joblog P825 J46 | tail -n 40

== CUDA [822814] INFO – add pending dealloc: cuMemFree 39200000 bytes
== CUDA [822814] INFO – dealloc: cuMemFree 39200000 bytes
== CUDA [822814] DEBUG – call driver api: cuMemFree(<CUdeviceptr 140045562413056>)
== CUDA [822814] INFO – add pending dealloc: cuMemFreeHost ? bytes
== CUDA [822814] INFO – dealloc: cuMemFreeHost ? bytes
== CUDA [822814] DEBUG – call driver api: cuMemFreeHost(140045461749760)
Traceback (most recent call last):
== CUDA [822820] INFO – add pending dealloc: cuMemFree 32400000 bytes
File “/var/home/cryosparc_user/cryosparc_worker/cryosparc_compute/jobs/runcommon.py”, line 2294, in run_with_except_hook
== CUDA [822820] INFO – dealloc: cuMemFree 32400000 bytes
== CUDA [822820] DEBUG – call driver api: cuMemFree(<CUdeviceptr 140045226868736>)
== CUDA [822823] DEBUG – call driver api: cuCtxGetCurrent()
== CUDA [822823] DEBUG – call driver api: cuCtxGetDevice()
== CUDA [822823] DEBUG – call driver api: cuCtxPopCurrent()
set status to failed
**custom thread exception hook caught something
**** handle exception rc
run_old(*args, **kw)
File “cryosparc_master/cryosparc_compute/gpu/gpucore.py”, line 134, in cryosparc_master.cryosparc_compute.gpu.gpucore.GPUThread.run
File “cryosparc_master/cryosparc_compute/gpu/gpucore.py”, line 135, in cryosparc_master.cryosparc_compute.gpu.gpucore.GPUThread.run
File “cryosparc_master/cryosparc_compute/jobs/class2D/newrun.py”, line 639, in cryosparc_master.cryosparc_compute.jobs.class2D.newrun.class2D_engine_run.work
File “cryosparc_master/cryosparc_compute/engine/newengine.py”, line 1383, in cryosparc_master.cryosparc_compute.engine.newengine.EngineThread.compute_resid_pow
File “cryosparc_master/cryosparc_compute/gpu/gpucore.py”, line 374, in cryosparc_master.cryosparc_compute.gpu.gpucore.EngineBaseThread.ensure_allocated
File “/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/devices.py”, line 232, in _require_cuda_context
return fn(*args, **kws)
File “/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/api.py”, line 189, in pinned_array
buffer = current_context().memhostalloc(bytesize)
File “/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 1378, in memhostalloc
return self.memory_manager.memhostalloc(bytesize, mapped, portable, wc)
File “/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 889, in memhostalloc
pointer = allocator()
File “/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 884, in allocator
return driver.cuMemHostAlloc(size, flags)
File “/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 344, in verbose_cuda_api_call
return self._check_cuda_python_error(fname, libfn(*args))
File “/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 408, in _check_cuda_python_error
raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [CUresult.CUDA_ERROR_INVALID_VALUE] Call to cuMemHostAlloc results in CUDA_ERROR_INVALID_VALUE
========= main process now complete at 2024-05-30 16:50:25.091160.
========= monitor process now complete at 2024-05-30 16:50:25.096414.

Thanks @abrilot.
Is there a cluster workload manager, like gridengine or slurm, used in job submission?
Were other jobs (not necessarily CryoSPARC) running on node hydra when job J46 failed?

no sge or slurm. We almost never have other users run jobs on our cryosparc servers, and this used to happen often enough that I doubt it was due to other jobs (though I haven’t seen users report this specific issue in some time).

@abrilot Please can you post the outputs of these commands on hydra:

uname -a
grep -v LICENSE_ID /var/home/cryosparc_user/cryosparc_worker/config.sh

Linux hydra 5.4.0-193-generic #213-Ubuntu SMP Fri Aug 2 19:14:16 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
cryosparc_user@hydra:~$ grep -v LICENSE_ID /var/home/cryosparc_user/cryosparc_worker/config.sh

export CRYOSPARC_USE_GPU=true
export CRYOSPARC_CUDA_PATH=“/usr/local/cuda”
export CRYOSPARC_DEVELOP=false
export NUMBA_CUDA_LOG_LEVEL=“DEBUG”
export NUMBA_CUDA_LOG_API_ARGS=1
export CRYOSPARC_NO_PAGELOCK=true

@abrilot Please can you run these commands on the hydra computer:

cd $(mktemp -d)
pwd # make a note so you can find the directory later
tar zcvf worker_config.tgz /var/home/cryosparc_user/cryosparc_worker/config.sh

and send us the worker_config.tgz file. I will let you know the email address via direct messaging.
We would like to examine the file for non-printing characters that may affect the CRYOSPARC_NO_PAGELOCK setting.

To be clear, this issue seems resolved for us, so I don’t think we have any problems in our config file, I was just posting from files I still have from when we had this failure.

Let me know if you still want the config file.

Glad to hear that @abrilot. How did you resolve

?

@abrilot how did you solve this issue? I am having the same issue in v 4.7.1. I get exactly the same error during ab-initio runs. 1) It only happens to ab-initio jobs with lots of classes. So far, jobs with up to 40 classes have worked, but 60 or 80 classes’ jobs will crash at variable points (sometimes after 1h or after 6hs). 2) It does not seem to be tied to a lack of memory because the error has happened when we were only using 200 particles, but sometimes the job will run all the way to 100k particle mark before it crashes. I am operating a workstation remotely and copying files to ssd, so it does not seem to be related to communication issues (as far as I understand).

@joana.paulino Does your cryosparc_worker/config.sh file already contain the line

export CRYOSPARC_NO_PAGELOCK=true

?

Yes, we do have it. I just double checked.

Please can you post the output of the command (replacing P99 and J199 with the actual IDs of the failed ab initio job):

cryosparcm joblog P99 J199 | grep "HOST ALLOCATION FUNCTION"

I run the command for all 3 jobs that failed, the return was the same for all of them:
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)