I’m getting repeated reports of Heterogeneous refinement jobs failing after upgrading to 4.5.1. A reboot of the worker nodes seems to solve the issue temporarily, but it recurs after a day or two.
Here is the error message:
Traceback (most recent call last): File “/var/home/cryosparc_user/cryosparc_worker/cryosparc_compute/jobs/runcommon.py”, line 2294, in run_with_except_hook run_old(*args, **kw) File “cryosparc_master/cryosparc_compute/gpu/gpucore.py”, line 134, in cryosparc_master.cryosparc_compute.gpu.gpucore.GPUThread.run File “cryosparc_master/cryosparc_compute/gpu/gpucore.py”, line 135, in cryosparc_master.cryosparc_compute.gpu.gpucore.GPUThread.run File “cryosparc_master/cryosparc_compute/engine/engine.py”, line 1134, in cryosparc_master.cryosparc_compute.engine.engine.process.work File “cryosparc_master/cryosparc_compute/engine/engine.py”, line 348, in cryosparc_master.cryosparc_compute.engine.engine.EngineThread.compute_resid_pow File “cryosparc_master/cryosparc_compute/gpu/gpucore.py”, line 374, in cryosparc_master.cryosparc_compute.gpu.gpucore.EngineBaseThread.ensure_allocated File “/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/devices.py”, line 232, in _require_cuda_context return fn(*args, **kws) File “/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/api.py”, line 189, in pinned_array buffer = current_context().memhostalloc(bytesize) File “/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 1378, in memhostalloc return self.memory_manager.memhostalloc(bytesize, mapped, portable, wc) File “/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 889, in memhostalloc pointer = allocator() File “/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 884, in allocator return driver.cuMemHostAlloc(size, flags) File “/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 348, in safe_cuda_api_call return self._check_cuda_python_error(fname, libfn(*args)) File “/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 408, in _check_cuda_python_error raise CudaAPIError(retcode, msg) numba.cuda.cudadrv.driver.CudaAPIError: [CUresult.CUDA_ERROR_INVALID_VALUE] Call to cuMemHostAlloc results in CUDA_ERROR_INVALID_VALUE
1 Like
A bit more info from master:
uname -a && free -g
Linux 5.4.0-170-generic #188-Ubuntu SMP Wed Jan 10 09:51:01 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
total used free shared buff/cache available
Mem: 93 22 0 0 70 70
Swap: 152 1 151
from worker:
env | grep PATH
CRYOSPARC_PATH=/var/home/cryosparc_user/cryosparc_worker/bin
MANPATH=:/opt/puppetlabs/puppet/share/man
PYTHONPATH=/var/home/cryosparc_user/cryosparc_worker
CRYOSPARC_CUDA_PATH=/usr/local/cuda
NUMBA_CUDA_INCLUDE_PATH=/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/include
LD_LIBRARY_PATH=
PATH=/var/home/cryosparc_user/cryosparc_worker/bin:/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/bin:/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/condabin:/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/bin:/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/opt/puppetlabs/bin:/opt/Supermicro/SuperDoctor5
/sbin/ldconfig -p | grep -i cuda
libicudata.so.66 (libc6,x86-64) => /lib/x86_64-linux-gnu/libicudata.so.66
libcudadebugger.so.1 (libc6,x86-64) => /lib/x86_64-linux-gnu/libcudadebugger.so.1
libcuda.so.1 (libc6,x86-64) => /lib/x86_64-linux-gnu/libcuda.so.1
libcuda.so (libc6,x86-64) => /lib/x86_64-linux-gnu/libcuda.so
uname -a
Linux 5.4.0-181-generic #201-Ubuntu SMP Thu Mar 28 15:39:01 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
free -g
total used free shared buff/cache available
Mem: 250 24 199 0 26 223
Swap: 95 0 95
nvidia-smi
Fri May 24 11:40:39 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04 Driver Version: 535.171.04 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A5000 Off | 00000000:1B:00.0 Off | Off |
| 30% 47C P2 131W / 230W | 4005MiB / 24564MiB | 64% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX A5000 Off | 00000000:1C:00.0 Off | Off |
| 30% 27C P8 22W / 230W | 12MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA RTX A5000 Off | 00000000:1D:00.0 Off | Off |
| 30% 28C P8 19W / 230W | 12MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA RTX A5000 Off | 00000000:1E:00.0 Off | Off |
| 30% 28C P8 19W / 230W | 12MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA RTX A5000 Off | 00000000:B2:00.0 Off | Off |
| 30% 27C P8 20W / 230W | 12MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA RTX A5000 Off | 00000000:B3:00.0 Off | Off |
| 30% 29C P8 19W / 230W | 12MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA RTX A5000 Off | 00000000:B4:00.0 Off | Off |
| 30% 28C P8 18W / 230W | 12MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA RTX A5000 Off | 00000000:B5:00.0 Off | Off |
| 30% 28C P8 20W / 230W | 12MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2681 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 9302 C python 3976MiB |
| 1 N/A N/A 2681 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 2681 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 2681 G /usr/lib/xorg/Xorg 4MiB |
| 4 N/A N/A 2681 G /usr/lib/xorg/Xorg 4MiB |
| 5 N/A N/A 2681 G /usr/lib/xorg/Xorg 4MiB |
| 6 N/A N/A 2681 G /usr/lib/xorg/Xorg 4MiB |
| 7 N/A N/A 2681 G /usr/lib/xorg/Xorg 4MiB |
+---------------------------------------------------------------------------------------+
Thanks @abrilot For posting these details. Please can you add these lines
export NUMBA_CUDA_LOG_LEVEL="DEBUG"
export NUMBA_CUDA_LOG_API_ARGS=1
to your worker config file
/var/home/cryosparc_user/cryosparc_worker/config.sh
and email us the job log (job.log
inside the job directory or Metadata|Log in the GUI) when you encounter this error again. I will send you a private message with the email address.
Thanks @abrilot for sending us the job.log
. Based on the log’s inspection
- May we ask that you email us the corresponding job report.
- May we suggest that you define
export CRYOSPARC_NO_PAGELOCK=true
inside the file/var/home/cryosparc_user/cryosparc_worker/config.sh
(guide) and see if this setting has an effect on the occurrence of CUDA_ERROR_INVALID_VALUE
.
I can confirm that we are still getting the issues, now on many different types of jobs (motion correction, ab initio, heterogeneous refinement, etc…).
We are additionally seeing an unresponsive UI in cryosparc live (does not display micrographs, micrograph metadata statistics, etc). As this appears to have coincided with the update, do you have instructions on reverting versions?
@abrilot Sorry to hear you are experiencing these issues. May I ask
- Did you email us the job report (zip file) and we somehow missed that email?
- Did the issue continue to occur even after setting
CRYOSPARC_NO_PAGELOCK=true
?
- What CryoSPARC version did you upgrade from?
Did you email us the job report (zip file) and we somehow missed that email?
I believe so. I don’t know if it was zipped.
Did the issue continue to occur even after setting CRYOSPARC_NO_PAGELOCK=true?
Yes
What CryoSPARC version did you upgrade from?
I think one of the 4.4 versions, I have asked our sysadmin to double check.
The email may have been lost. Please can you email us the job report again? Job reports are already in zip format when downloaded through the CryoSPARC web app.
Sorry, I included the job log, not the job report previously. I sent it again, including the job report.
The CSlive problems appear to have been solved by updating to the latest version.
Also I am told it was 4.4.1, which is consistent with what I remember.
Thanks @abrilot . We received the report.
Thanks also for confirming this resolution.
Following up, were you able to find the cause of the original error in the thread?
In case you are referring to cuMemHostAlloc
, please can you try appending the line
export CRYOSPARC_NO_PAGELOCK=true
to the file
/var/home/cryosparc_user/cryosparc_worker/config.sh
(details) and then rerunning the job. Does cuMemHostAlloc
still occur after this adjustment?
Confirming again that the CuMemHostAlloc error still occurs after making the suggested changes.
@abrilot Thanks for trying that. Please can you email us the corresponding job report zip file?
@abrilot Thanks for emailing us the job report. Oddly, the job_log_*
files inside it is empty. What is the size on disk of the job.log
file inside the job directory?
About 2 GB.
I sent a job log a while ago about an identical issue as well, perhaps that one will help?
Otherwise I can find another way to get you the file.
The most recent job report corresponds to a different project, job and job type than the job report you sent us in June. The error in the latest job report is also different, this time indicating that a GPU device was out of RAM (cuMemAlloc
failed), as opposed to the host (cuMemHostAlloc
).
Either error could occur if too many jobs are running simultaneously on the host.
cuMemAlloc
could occur if jobs from a non-CryoSPARC application or from another CryoSPARC instance were aslo running on the same host and the same GPU as the CryoSPARC job that failed.
I have the same problem when I run 2D Class, Ab initio, and Heterogeneous Refinement job in v4.5.1. When I turn off the SSD Particle Caching or reboot the machine, it continues to work, but it’s slow, or the problem still happens after a few days. Now I’m ready to continue updating the version to 4.6 and hopefully it won’t recur in that issue.
export CRYOSPARC_NO_PAGELOCK=true
The command did not solve this problem.
/ssd/cs_scratch/instance_worker3:39001/store-v2 When I finished cleaning the SSD cache files, it works fine for now.
Please can you post the outputs of this command for the failed jobs
cryosparcm cli "get_job('P99', 'J199', 'job_type', 'version', 'instance_information', 'status', 'params_spec', 'errors_run')"
where you replace P99
and J199
with relevant jobs’ project and job IDs, respectively.