Heterogeneous Refinement fails in v 4.5.1

abrilot · May 24, 2024, 4:32pm

I’m getting repeated reports of Heterogeneous refinement jobs failing after upgrading to 4.5.1. A reboot of the worker nodes seems to solve the issue temporarily, but it recurs after a day or two.

Here is the error message:

Traceback (most recent call last): File “/var/home/cryosparc_user/cryosparc_worker/cryosparc_compute/jobs/runcommon.py”, line 2294, in run_with_except_hook run_old(*args, **kw) File “cryosparc_master/cryosparc_compute/gpu/gpucore.py”, line 134, in cryosparc_master.cryosparc_compute.gpu.gpucore.GPUThread.run File “cryosparc_master/cryosparc_compute/gpu/gpucore.py”, line 135, in cryosparc_master.cryosparc_compute.gpu.gpucore.GPUThread.run File “cryosparc_master/cryosparc_compute/engine/engine.py”, line 1134, in cryosparc_master.cryosparc_compute.engine.engine.process.work File “cryosparc_master/cryosparc_compute/engine/engine.py”, line 348, in cryosparc_master.cryosparc_compute.engine.engine.EngineThread.compute_resid_pow File “cryosparc_master/cryosparc_compute/gpu/gpucore.py”, line 374, in cryosparc_master.cryosparc_compute.gpu.gpucore.EngineBaseThread.ensure_allocated File “/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/devices.py”, line 232, in _require_cuda_context return fn(*args, **kws) File “/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/api.py”, line 189, in pinned_array buffer = current_context().memhostalloc(bytesize) File “/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 1378, in memhostalloc return self.memory_manager.memhostalloc(bytesize, mapped, portable, wc) File “/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 889, in memhostalloc pointer = allocator() File “/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 884, in allocator return driver.cuMemHostAlloc(size, flags) File “/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 348, in safe_cuda_api_call return self._check_cuda_python_error(fname, libfn(*args)) File “/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 408, in _check_cuda_python_error raise CudaAPIError(retcode, msg) numba.cuda.cudadrv.driver.CudaAPIError: [CUresult.CUDA_ERROR_INVALID_VALUE] Call to cuMemHostAlloc results in CUDA_ERROR_INVALID_VALUE

abrilot · May 24, 2024, 4:41pm

A bit more info from master:

uname -a && free -g
Linux  5.4.0-170-generic #188-Ubuntu SMP Wed Jan 10 09:51:01 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
              total        used        free      shared  buff/cache   available
Mem:             93          22           0           0          70          70
Swap:           152           1         151

from worker:

env | grep PATH
CRYOSPARC_PATH=/var/home/cryosparc_user/cryosparc_worker/bin
MANPATH=:/opt/puppetlabs/puppet/share/man
PYTHONPATH=/var/home/cryosparc_user/cryosparc_worker
CRYOSPARC_CUDA_PATH=/usr/local/cuda
NUMBA_CUDA_INCLUDE_PATH=/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/include
LD_LIBRARY_PATH=
PATH=/var/home/cryosparc_user/cryosparc_worker/bin:/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/bin:/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/condabin:/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/bin:/var/home/cryosparc_user/cryosparc_worker/deps/anaconda/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/opt/puppetlabs/bin:/opt/Supermicro/SuperDoctor5

/sbin/ldconfig -p | grep -i cuda
	libicudata.so.66 (libc6,x86-64) => /lib/x86_64-linux-gnu/libicudata.so.66
	libcudadebugger.so.1 (libc6,x86-64) => /lib/x86_64-linux-gnu/libcudadebugger.so.1
	libcuda.so.1 (libc6,x86-64) => /lib/x86_64-linux-gnu/libcuda.so.1
	libcuda.so (libc6,x86-64) => /lib/x86_64-linux-gnu/libcuda.so



uname -a
Linux  5.4.0-181-generic #201-Ubuntu SMP Thu Mar 28 15:39:01 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

free -g
              total        used        free      shared  buff/cache   available
Mem:            250          24         199           0          26         223
Swap:            95           0          95


nvidia-smi
Fri May 24 11:40:39 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A5000               Off | 00000000:1B:00.0 Off |                  Off |
| 30%   47C    P2             131W / 230W |   4005MiB / 24564MiB |     64%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A5000               Off | 00000000:1C:00.0 Off |                  Off |
| 30%   27C    P8              22W / 230W |     12MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A5000               Off | 00000000:1D:00.0 Off |                  Off |
| 30%   28C    P8              19W / 230W |     12MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A5000               Off | 00000000:1E:00.0 Off |                  Off |
| 30%   28C    P8              19W / 230W |     12MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA RTX A5000               Off | 00000000:B2:00.0 Off |                  Off |
| 30%   27C    P8              20W / 230W |     12MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA RTX A5000               Off | 00000000:B3:00.0 Off |                  Off |
| 30%   29C    P8              19W / 230W |     12MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA RTX A5000               Off | 00000000:B4:00.0 Off |                  Off |
| 30%   28C    P8              18W / 230W |     12MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA RTX A5000               Off | 00000000:B5:00.0 Off |                  Off |
| 30%   28C    P8              20W / 230W |     12MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2681      G   /usr/lib/xorg/Xorg                            4MiB |
|    0   N/A  N/A      9302      C   python                                     3976MiB |
|    1   N/A  N/A      2681      G   /usr/lib/xorg/Xorg                            4MiB |
|    2   N/A  N/A      2681      G   /usr/lib/xorg/Xorg                            4MiB |
|    3   N/A  N/A      2681      G   /usr/lib/xorg/Xorg                            4MiB |
|    4   N/A  N/A      2681      G   /usr/lib/xorg/Xorg                            4MiB |
|    5   N/A  N/A      2681      G   /usr/lib/xorg/Xorg                            4MiB |
|    6   N/A  N/A      2681      G   /usr/lib/xorg/Xorg                            4MiB |
|    7   N/A  N/A      2681      G   /usr/lib/xorg/Xorg                            4MiB |
+---------------------------------------------------------------------------------------+

wtempel · May 24, 2024, 7:32pm

Thanks @abrilot For posting these details. Please can you add these lines

export NUMBA_CUDA_LOG_LEVEL="DEBUG"
export NUMBA_CUDA_LOG_API_ARGS=1

to your worker config file
/var/home/cryosparc_user/cryosparc_worker/config.sh
and email us the job log (job.log inside the job directory or Metadata|Log in the GUI) when you encounter this error again. I will send you a private message with the email address.

wtempel · May 31, 2024, 3:00pm

Thanks @abrilot for sending us the job.log. Based on the log’s inspection

May we ask that you email us the corresponding job report.
May we suggest that you define
```
export CRYOSPARC_NO_PAGELOCK=true
```
inside the file
```
/var/home/cryosparc_user/cryosparc_worker/config.sh
```
(guide) and see if this setting has an effect on the occurrence of CUDA_ERROR_INVALID_VALUE.

abrilot · June 18, 2024, 5:05pm

I can confirm that we are still getting the issues, now on many different types of jobs (motion correction, ab initio, heterogeneous refinement, etc…).

We are additionally seeing an unresponsive UI in cryosparc live (does not display micrographs, micrograph metadata statistics, etc). As this appears to have coincided with the update, do you have instructions on reverting versions?

wtempel · June 18, 2024, 6:40pm

@abrilot Sorry to hear you are experiencing these issues. May I ask

Did you email us the job report (zip file) and we somehow missed that email?
Did the issue continue to occur even after setting CRYOSPARC_NO_PAGELOCK=true?
What CryoSPARC version did you upgrade from?

abrilot · June 19, 2024, 2:55am

Did you email us the job report (zip file) and we somehow missed that email?

I believe so. I don’t know if it was zipped.

Did the issue continue to occur even after setting CRYOSPARC_NO_PAGELOCK=true?

Yes
What CryoSPARC version did you upgrade from?

I think one of the 4.4 versions, I have asked our sysadmin to double check.

wtempel · June 19, 2024, 2:14pm

The email may have been lost. Please can you email us the job report again? Job reports are already in zip format when downloaded through the CryoSPARC web app.

abrilot · June 19, 2024, 9:10pm

Sorry, I included the job log, not the job report previously. I sent it again, including the job report.

The CSlive problems appear to have been solved by updating to the latest version.

abrilot · June 19, 2024, 10:20pm

Also I am told it was 4.4.1, which is consistent with what I remember.

wtempel · June 20, 2024, 3:21pm

Thanks @abrilot . We received the report.

Thanks also for confirming this resolution.

abrilot · June 25, 2024, 10:58pm

Following up, were you able to find the cause of the original error in the thread?

wtempel · July 4, 2024, 6:58pm

In case you are referring to cuMemHostAlloc, please can you try appending the line

export CRYOSPARC_NO_PAGELOCK=true

to the file

/var/home/cryosparc_user/cryosparc_worker/config.sh

(details) and then rerunning the job. Does cuMemHostAlloc still occur after this adjustment?

abrilot · July 8, 2024, 5:53pm

Confirming again that the CuMemHostAlloc error still occurs after making the suggested changes.

wtempel · July 8, 2024, 7:04pm

@abrilot Thanks for trying that. Please can you email us the corresponding job report zip file?

wtempel · July 9, 2024, 6:44pm

@abrilot Thanks for emailing us the job report. Oddly, the job_log_* files inside it is empty. What is the size on disk of the job.log file inside the job directory?

abrilot · July 10, 2024, 7:03pm

About 2 GB.

I sent a job log a while ago about an identical issue as well, perhaps that one will help?

Otherwise I can find another way to get you the file.

wtempel · July 10, 2024, 7:40pm

The most recent job report corresponds to a different project, job and job type than the job report you sent us in June. The error in the latest job report is also different, this time indicating that a GPU device was out of RAM (cuMemAlloc failed), as opposed to the host (cuMemHostAlloc).
Either error could occur if too many jobs are running simultaneously on the host.
cuMemAlloc could occur if jobs from a non-CryoSPARC application or from another CryoSPARC instance were aslo running on the same host and the same GPU as the CryoSPARC job that failed.

niejiawei · September 27, 2024, 4:25am

I have the same problem when I run 2D Class, Ab initio, and Heterogeneous Refinement job in v4.5.1. When I turn off the SSD Particle Caching or reboot the machine, it continues to work, but it’s slow, or the problem still happens after a few days. Now I’m ready to continue updating the version to 4.6 and hopefully it won’t recur in that issue.

export CRYOSPARC_NO_PAGELOCK=true
The command did not solve this problem.

/ssd/cs_scratch/instance_worker3:39001/store-v2 When I finished cleaning the SSD cache files, it works fine for now.

wtempel · October 1, 2024, 4:09pm

Please can you post the outputs of this command for the failed jobs

cryosparcm cli "get_job('P99', 'J199', 'job_type', 'version', 'instance_information', 'status',  'params_spec', 'errors_run')"

where you replace P99 and J199 with relevant jobs’ project and job IDs, respectively.