I am encountering an error while performing 3D Variability

nmillan · December 19, 2024, 9:22pm

I have recently updated to v4.6.2 and I started running into a few issues. I was told by our IT help that the io_uring support is checked by using a function in the liburing library which is installed, and the running Linux kernel has support enabled so it’s puzzling that it is not working. And I am not sure what to make of the other issue.

The first issue is the following:
[CPU: 89.8 MB Avail: 244.52 GB]
WARNING: io_uring support disabled (not supported by kernel), I/O performance may degrade

The second issue:
[CPU: 5.93 GB Avail: 239.77 GB]
Traceback (most recent call last):
File “cryosparc_master/cryosparc_compute/run.py”, line 129, in cryosparc_master.cryosparc_compute.run.main
File “cryosparc_master/cryosparc_compute/jobs/var3D/run.py”, line 546, in cryosparc_master.cryosparc_compute.jobs.var3D.run.run
File “cryosparc_master/cryosparc_compute/jobs/var3D/run.py”, line 323, in cryosparc_master.cryosparc_compute.jobs.var3D.run.run.E_step
File “cryosparc_master/cryosparc_compute/engine/newengine.py”, line 400, in cryosparc_master.cryosparc_compute.engine.newengine.EngineThread.load_models_rspace
File “cryosparc_master/cryosparc_compute/gpu/gpucore.py”, line 382, in cryosparc_master.cryosparc_compute.gpu.gpucore.EngineBaseThread.ensure_allocated
File “/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/devices.py”, line 232, in _require_cuda_context
return fn(*args, **kws)
File “/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/api.py”, line 189, in pinned_array
buffer = current_context().memhostalloc(bytesize)
File “/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 1378, in memhostalloc
return self.memory_manager.memhostalloc(bytesize, mapped, portable, wc)
File “/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 889, in memhostalloc
pointer = allocator()
File “/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 884, in allocator
return driver.cuMemHostAlloc(size, flags)
File “/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 348, in safe_cuda_api_call
return self._check_cuda_python_error(fname, libfn(*args))
File “/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 408, in _check_cuda_python_error
raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [CUresult.CUDA_ERROR_INVALID_VALUE] Call to cuMemHostAlloc results in CUDA_ERROR_INVALID_VALUE

wtempel · December 19, 2024, 9:58pm

For an explanation why io_uring may still not be supported, please see Io_uring enabling - #10 by hsnyder.

Please can you post the outputs of these commands

on the CryoSPARC master

csprojectid=P99 # replace with actual project ID
csjobid=J199 # replace with id of the failed job
cryosparcm cli "get_job('$csprojectid', '$csjobid', 'job_type', 'version', 'instance_information', 'status',  'params_spec', 'errors_run')"

on the CryoSPARC worker where the job ran and failed

uptime
uname -a 
nvidia-smi
/home/cryosparc_user/cryosparc_worker/bin/cryosparcw gpulist

nmillan · December 19, 2024, 10:49pm

Here is the output:

{‘_id’: ‘674ce620d558853f1556fb36’, ‘errors_run’: [{‘message’: ‘[CUresult.CUDA_ERROR_INVALID_VALUE] Call to cuMemHostAlloc results in CUDA_ERROR_INVALID_VALUE’, ‘warning’: False}], ‘instance_information’: {‘CUDA_version’: ‘11.8’, ‘available_memory’: ‘240.73GB’, ‘cpu_model’: ‘Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz’, ‘driver_version’: ‘12.4’, ‘gpu_info’: [{‘id’: 0, ‘mem’: 11539054592, ‘name’: ‘NVIDIA GeForce RTX 2080 Ti’, ‘pcie’: ‘0000:3b:00’}, {‘id’: 1, ‘mem’: 11539054592, ‘name’: ‘NVIDIA GeForce RTX 2080 Ti’, ‘pcie’: ‘0000:5e:00’}, {‘id’: 2, ‘mem’: 11539054592, ‘name’: ‘NVIDIA GeForce RTX 2080 Ti’, ‘pcie’: ‘0000:86:00’}, {‘id’: 3, ‘mem’: 11539054592, ‘name’: ‘NVIDIA GeForce RTX 2080 Ti’, ‘pcie’: ‘0000:d8:00’}], ‘ofd_hard_limit’: 262144, ‘ofd_soft_limit’: 1024, ‘physical_cores’: 24, ‘platform_architecture’: ‘x86_64’, ‘platform_node’: ‘thelma’, ‘platform_release’: ‘5.4.286-1.el8.elrepo.x86_64’, ‘platform_version’: ‘#1 SMP Sun Nov 17 11:28:26 EST 2024’, ‘total_memory’: ‘251.53GB’, ‘used_memory’: ‘7.22GB’}, ‘job_type’: ‘var_3D’, ‘params_spec’: {‘compute_use_ssd’: {‘value’: False}, ‘var_K’: {‘value’: 4}, ‘var_filter_res’: {‘value’: 5}}, ‘project_uid’: ‘P17’, ‘status’: ‘failed’, ‘uid’: ‘J59’, ‘version’: ‘v4.6.2’}

Here is the output:

16:47:17 up 6 days, 6:28, 5 users, load average: 1.06, 1.16, 1.09

Linux thelma 5.4.286-1.el8.elrepo.x86_64 #1 SMP Sun Nov 17 11:28:26 EST 2024 x86_64 x86_64 x86_64 GNU/Linux

±----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 6261 G /usr/libexec/Xorg 120MiB |
| 0 N/A N/A 6397 G /usr/bin/gnome-shell 39MiB |
| 0 N/A N/A 7820 G /usr/lib64/firefox/firefox 161MiB |
| 1 N/A N/A 6261 G /usr/libexec/Xorg 4MiB |
| 2 N/A N/A 6261 G /usr/libexec/Xorg 4MiB |
| 3 N/A N/A 6261 G /usr/libexec/Xorg 4MiB |
±----------------------------------------------------------------------------------------+

-bash: /home/cryosparc_user/cryosparc_worker/bin/cryosparcw: Permission denied

wtempel · December 20, 2024, 3:49pm

Thanks @nmillan for posting the outputs.
Please can you post the outputs of this command on thelma:

grep -v LICENSE_ID /home/cryosparc_user/cryosparc_worker/config.sh

If that file does not already contain a line

export CRYOSPARC_NO_PAGELOCK=true

please add or adjust that line to/in that file and test if the change resolves the CUDA_ERROR_INVALID_VALUE issue.
To view and, if needed, change /home/cryosparc_user/cryosparc_worker/config.sh, one may have to be logged in to the cryosparc_user Linux account.

nmillan · January 2, 2025, 4:26pm

I have added that line and tested it but it has not yet resolved the CUDA_ERROR_INVALID_VALUE issue.

wtempel · January 2, 2025, 4:44pm

Please can you post the end of a job log for a job that failed with CUDA_ERROR_INVALID_VALUE after

had been defined. You may use the command (after appropriately modifying P and J IDs):

cryosparcm joblog P99 J199 | tail -n 40

and post its output.

nmillan · January 2, 2025, 5:28pm

Here is the output:
========= sending heartbeat at 2024-12-30 19:33:24.451955
========= sending heartbeat at 2024-12-30 19:33:34.470288
========= sending heartbeat at 2024-12-30 19:33:44.489571
========= sending heartbeat at 2024-12-30 19:33:54.507268
========= sending heartbeat at 2024-12-30 19:34:04.525384
========= sending heartbeat at 2024-12-30 19:34:14.545750
/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/dispatcher.py:536: NumbaPerformanceWarning: Grid size 1 will likely result in GPU under-utilization due to low occupancy.
warn(NumbaPerformanceWarning(msg))

Transparent hugepages setting: always madvise [never]

Running job J59 of type var_3D
Running job on hostname %s localhost
Allocated Resources : {‘fixed’: {‘SSD’: False}, ‘hostname’: ‘localhost’, ‘lane’: ‘default’, ‘lane_type’: ‘node’, ‘license’: True, ‘licenses_acquired’: 1, ‘slots’: {‘CPU’: [0, 1, 2, 3], ‘GPU’: [0], ‘RAM’: [0, 1, 2]}, ‘target’: {‘cache_path’: ‘/CryoSparc/cryosparc_scratch’, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 10000, ‘desc’: None, ‘gpus’: [{‘id’: 0, ‘mem’: 11539054592, ‘name’: ‘NVIDIA GeForce RTX 2080 Ti’}, {‘id’: 1, ‘mem’: 11539054592, ‘name’: ‘NVIDIA GeForce RTX 2080 Ti’}, {‘id’: 2, ‘mem’: 11539054592, ‘name’: ‘NVIDIA GeForce RTX 2080 Ti’}, {‘id’: 3, ‘mem’: 11539054592, ‘name’: ‘NVIDIA GeForce RTX 2080 Ti’}], ‘hostname’: ‘localhost’, ‘lane’: ‘default’, ‘monitor_port’: None, ‘name’: ‘localhost’, ‘resource_fixed’: {‘SSD’: True}, ‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47], ‘GPU’: [0, 1, 2, 3], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]}, ‘ssh_str’: ‘cryosparc_user@localhost’, ‘title’: ‘Worker node localhost’, ‘type’: ‘node’, ‘worker_bin_path’: ‘/home/cryosparc_user/cryosparc_worker/bin/cryosparcw’}}
HOST ALLOCATION FUNCTION: using numba.cuda.pinned_array
**** handle exception rc
Traceback (most recent call last):
File “cryosparc_master/cryosparc_compute/run.py”, line 129, in cryosparc_master.cryosparc_compute.run.main
File “cryosparc_master/cryosparc_compute/jobs/var3D/run.py”, line 546, in cryosparc_master.cryosparc_compute.jobs.var3D.run.run
File “cryosparc_master/cryosparc_compute/jobs/var3D/run.py”, line 323, in cryosparc_master.cryosparc_compute.jobs.var3D.run.run.E_step
File “cryosparc_master/cryosparc_compute/engine/newengine.py”, line 400, in cryosparc_master.cryosparc_compute.engine.newengine.EngineThread.load_models_rspace
File “cryosparc_master/cryosparc_compute/gpu/gpucore.py”, line 382, in cryosparc_master.cryosparc_compute.gpu.gpucore.EngineBaseThread.ensure_allocated
File “/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/devices.py”, line 232, in _require_cuda_context
return fn(*args, **kws)
File “/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/api.py”, line 189, in pinned_array
buffer = current_context().memhostalloc(bytesize)
File “/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 1378, in memhostalloc
return self.memory_manager.memhostalloc(bytesize, mapped, portable, wc)
File “/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 889, in memhostalloc
pointer = allocator()
File “/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 884, in allocator
return driver.cuMemHostAlloc(size, flags)
File “/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 348, in safe_cuda_api_call
return self._check_cuda_python_error(fname, libfn(*args))
File “/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py”, line 408, in _check_cuda_python_error
raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [CUresult.CUDA_ERROR_INVALID_VALUE] Call to cuMemHostAlloc results in CUDA_ERROR_INVALID_VALUE
set status to failed
========= main process now complete at 2024-12-30 19:34:24.567087.

wtempel · January 2, 2025, 6:35pm

@nmillan Please can you post the output of the command

grep -v LICENSE /home/cryosparc_user/cryosparc_worker/config.sh

nmillan · January 2, 2025, 10:35pm

Here is the output from the command
export CRYOSPARC_USE_GPU=true

wtempel · January 2, 2025, 10:43pm

@nmillan It seems that the definition

has not been added to

Please can you add that definition to the file and check if you can run 3DVA job after the change.

nmillan · January 6, 2025, 3:16pm

Hello, I manually added the definition to the config.sh file (sorry for the bad quality)

I then called the file using ./config.sh and ran the grep -v LICENSE /home/cryosparc_user/cryosparc_worker/config.sh command which gave the following output:

export CRYOSPARC_USE_GPU=true
export CRYOSPARC_CUDA_PATH=“/usr/local/cuda”
export CRYOSPARC_DEVELOP=false
export CRYOSPARC_NO_PAGELOCK=true

Then I tried running the 3DVA job but I am still getting the same error.

wtempel · January 7, 2025, 8:26pm

Did you run that job after Dec 30, 2024? If so, please can you post the output of the command

cryosparcm joblog P99 J199 | tail -n 50

(after having replaced P99, J199 with the project and job IDs of the latest failed 3DVA job).

nmillan · January 14, 2025, 3:09pm

Hello, I am sorry for my delayed response. Recently our IT specialist updated the kernel to fix the io_uring issues and in setting up cryosparc to make sure it was working well, we resolved the CUDA problem I was facing here. Thank you so much for your help along the way!