pycuda._driver.MemoryError: cuMemHostAlloc failed: out of memory in v3.0

crescalante · December 16, 2020, 4:00pm

I updated to version 3.0 and run 2D classification on ~ 3M particles. The following error appeared:
[CPU: 7.43 GB] Traceback (most recent call last):
File “/home/xxx/cryosparc/cryosparc_worker/cryosparc_compute/jobs/runcommon.py”, line 1711, in run_with_except_hook
run_old(*args, **kw)
File “cryosparc_worker/cryosparc_compute/engine/cuda_core.py”, line 129, in cryosparc_compute.engine.cuda_core.GPUThread.run
File “cryosparc_worker/cryosparc_compute/engine/cuda_core.py”, line 130, in cryosparc_compute.engine.cuda_core.GPUThread.run
File “cryosparc_worker/cryosparc_compute/engine/engine.py”, line 1066, in cryosparc_compute.engine.engine.process.work
File “cryosparc_worker/cryosparc_compute/engine/engine.py”, line 499, in cryosparc_compute.engine.engine.EngineThread.cull_candidates
File “cryosparc_worker/cryosparc_compute/engine/cuda_core.py”, line 312, in cryosparc_compute.engine.cuda_core.EngineBaseThread.ensure_allocated
pycuda._driver.MemoryError: cuMemHostAlloc failed: out of memory

The box size was 300 and was running on a system with 4 GTX 1080 Ti GPUs. I run again with ~ 1M particles and the box size 256. Same error.I had previously run this data on cryosparc v2.5 using ~ 1M particles and was working OK. Any ideas why the error? Thanks.

jamon · December 17, 2020, 2:13am

I ran into the same problem, previous dataset cause cuda error in 2D classification. I just upgraded to 3.0.1 today, it seems working so far. I am wondering it may due to some cuda compatibility issue.

crescalante · December 17, 2020, 2:44am

I will update to 3.0.1. I have Cuda 10.2 and it was ok in v12.5. Thanks

crescalante · December 17, 2020, 4:00am

That solved the problem. Updating to 3.01. Thanks

crescalante · December 17, 2020, 3:18pm

Actually, there are still some issues. I had a particle set with around 7M particles and split it into four. Three of them worked fine but one still had the “cuMemHostAlloc failed: out of memory”. Same number of particles.

MHB · January 16, 2021, 5:05am

I am seeing similar issue in heterorefinement in 3.0.1

aceback (most recent call last):
File “/home/cryosparc_user/software/cryosparc2_worker/cryosparc_compute/jobs/runcommon.py”, line 1722, in run_with_except_hook
run_old(*args, **kw)
File “cryosparc_worker/cryosparc_compute/engine/cuda_core.py”, line 129, in cryosparc_compute.engine.cuda_core.GPUThread.run
File “cryosparc_worker/cryosparc_compute/engine/cuda_core.py”, line 130, in cryosparc_compute.engine.cuda_core.GPUThread.run
File “cryosparc_worker/cryosparc_compute/engine/engine.py”, line 1066, in cryosparc_compute.engine.engine.process.work
File “cryosparc_worker/cryosparc_compute/engine/engine.py”, line 499, in cryosparc_compute.engine.engine.EngineThread.cull_candidates
File “cryosparc_worker/cryosparc_compute/engine/cuda_core.py”, line 312, in cryosparc_compute.engine.cuda_core.EngineBaseThread.ensure_allocated
pycuda._driver.MemoryError: cuMemHostAlloc failed: out of memory

MHB · January 16, 2021, 5:23am

This also occurs when running prior data that completed without error in V2.15

MHB · January 18, 2021, 8:02pm

This appears to be related to cryoSPARC mis-allocating memory. In our case if you stop cryoSPARC and reboot the system then restart cryoSPARC everything is fine. Not sure how to troubleshoot this problem or why it happens.

spunjani · February 8, 2021, 7:33pm

@MHB, can you please let us know which OS you are running, as well as which GPUs and your NVIDIA driver version?

crescalante · February 8, 2021, 8:13pm

The same happened to me running a heterogeneous refinement job on v3.1. I stopped cryosparc and rebooted computer. After that, job run fine. The rebooting seems to be required as a simple stopping and starting cryosparc did not work.

stephan · February 8, 2021, 9:08pm

Hi @crescalante,

Can you run the following command and paste it here:
lscpu && free -g && uname -a and if you have sudo, run the command sudo dmidecode --type memory as well.

Paul · February 9, 2021, 11:27pm

I’m having the same issue along with a host of others, I was at cryosparc 2.6.1 and it was upgraded after a cryosparcm stop/start and reboot of the system but I get this issue.

Traceback (most recent call last):
File “cryosparc2_worker/cryosparc2_compute/run.py”, line 72, in cryosparc2_compute.run.main
File “cryosparc2_compute/jobs/jobregister.py”, line 337, in get_run_function
runmod = importlib.import_module(“…”+modname, name)
File “/opt/packages/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/importlib/init.py”, line 37, in import_module
import(name)
File “cryosparc2_worker/cryosparc2_compute/jobs/rtp_workers/run.py”, line 20, in init cryosparc2_compute.jobs.rtp_workers.run
File “cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/motioncorrection.py”, line 8, in init cryosparc2_compute.jobs.motioncorrection.motioncorrection
File “cryosparc2_compute/engine/init.py”, line 8, in
from engine import *
File “cryosparc2_worker/cryosparc2_compute/engine/engine.py”, line 17, in init cryosparc2_compute.engine.engine
File “cryosparc2_compute/fourier.py”, line 22, in
from numba import autojit
ImportError: cannot import name autojit

So I re-installed cryosparc3 and reinstalled nvidia 440.82 drivers and cuda 10.2 toolkits. Then the main error I get when running a live session is the following

Traceback (most recent call last):
  File "cryosparc_worker/cryosparc_compute/jobs/rtp_workers/run.py", line 356, in cryosparc_compute.jobs.rtp_workers.run.rtp_worker
  File "cryosparc_worker/cryosparc_compute/jobs/rtp_workers/run.py", line 417, in cryosparc_compute.jobs.rtp_workers.run.process_movie
  File "cryosparc_worker/cryosparc_compute/jobs/rtp_workers/run.py", line 561, in cryosparc_compute.jobs.rtp_workers.run.do_patch_motion
  File "cryosparc_worker/cryosparc_compute/jobs/rtp_workers/run.py", line 566, in cryosparc_compute.jobs.rtp_workers.run.do_patch_motion
  File "cryosparc_worker/cryosparc_compute/jobs/motioncorrection/patchmotion.py", line 251, in cryosparc_compute.jobs.motioncorrection.patchmotion.unbend_motion_correction
  File "cryosparc_worker/cryosparc_compute/jobs/motioncorrection/patchmotion.py", line 371, in cryosparc_compute.jobs.motioncorrection.patchmotion.unbend_motion_correction
  File "cryosparc_worker/cryosparc_compute/engine/cuda_core.py", line 339, in cryosparc_compute.engine.cuda_core.EngineBaseThread.ensure_allocated
  File "/opt/packages/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/pycuda/gpuarray.py", line 210, in __init__
    self.gpudata = self.allocator(self.size * self.dtype.itemsize)
pycuda._driver.MemoryError: cuMemAlloc failed: out of memory

I ran the mentioned command in this thread as well.

lscpu && free -g && uname -a
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 48
On-line CPU(s) list: 0-47
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Silver 4116 CPU @ 2.10GHz
Stepping: 4
CPU MHz: 800.008
BogoMIPS: 4200.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 16896K
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear flush_l1d
total used free shared buff/cache available
Mem: 282 36 134 0 111 244
Swap: 1 0 1
Linux 5.4.0-65-generic #73~18.04.1-Ubuntu SMP Tue Jan 19 09:02:24 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

dmidecode --type memory

dmidecode 3.1

Getting SMBIOS data from sysfs.
SMBIOS 3.0.0 present.

Handle 0x1000, DMI type 16, 23 bytes
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: Multi-bit ECC
Maximum Capacity: 3 TB
Error Information Handle: Not Provided
Number Of Devices: 24

Handle 0x1100, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x1000
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 16384 MB
Form Factor: DIMM
Set: 1
Locator: A1
Bank Locator: Not Specified
Type: DDR4
Type Detail: Synchronous Registered (Buffered)
Speed: 2666 MT/s
Manufacturer: 00AD00B300AD
Serial Number: 520C8E4F
Asset Tag: 01173851
Part Number: HMA82GR7AFR8N-VK
Rank: 2
Configured Clock Speed: 2400 MT/s
Minimum Voltage: 1.2 V
Maximum Voltage: 1.2 V
Configured Voltage: 1.2 V

spunjani · February 10, 2021, 5:00pm

@Paul can you please let us know which GPUs you are using? You may need to turn on “low memory mode” in cryoSPARC Live

Paul · February 10, 2021, 7:26pm

Here is the nvidia-smi output of a typical worker node.

±----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82 Driver Version: 440.82 CUDA Version: 10.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro P5000 Off | 00000000:3B:00.0 Off | Off |
| 22% 37C P0 42W / 180W | 0MiB / 16278MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Quadro P5000 Off | 00000000:D8:00.0 Off | Off |
| 22% 35C P0 42W / 180W | 0MiB / 16278MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

Here is our larger GPU server while running a live session that generates the error.

±----------------------------------------------------------------------------+
| NVIDIA-SMI 450.66 Driver Version: 450.66 CUDA Version: 11.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro RTX 6000 Off | 00000000:1A:00.0 Off | 0 |
| N/A 31C P0 56W / 250W | 908MiB / 22698MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 Quadro RTX 6000 Off | 00000000:1B:00.0 Off | 0 |
| N/A 24C P8 13W / 250W | 8MiB / 22698MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 2 Quadro RTX 6000 Off | 00000000:3D:00.0 Off | 0 |
| N/A 24C P8 13W / 250W | 8MiB / 22698MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 3 Quadro RTX 6000 Off | 00000000:3E:00.0 Off | 0 |
| N/A 25C P8 13W / 250W | 8MiB / 22698MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 4 Quadro RTX 6000 Off | 00000000:8B:00.0 Off | 0 |
| N/A 24C P8 12W / 250W | 8MiB / 22698MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 5 Quadro RTX 6000 Off | 00000000:8C:00.0 Off | 0 |
| N/A 26C P8 13W / 250W | 8MiB / 22698MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 6 Quadro RTX 6000 Off | 00000000:B5:00.0 Off | 0 |
| N/A 25C P8 14W / 250W | 8MiB / 22698MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 7 Quadro RTX 6000 Off | 00000000:B6:00.0 Off | 0 |
| N/A 24C P8 13W / 250W | 8MiB / 22698MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1755 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 9952 C python 225MiB |
| 0 N/A N/A 10064 C python 225MiB |
| 0 N/A N/A 10150 C python 225MiB |
| 0 N/A N/A 10238 C python 225MiB |
| 1 N/A N/A 1755 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 1755 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 1755 G /usr/lib/xorg/Xorg 4MiB |
| 4 N/A N/A 1755 G /usr/lib/xorg/Xorg 4MiB |
| 5 N/A N/A 1755 G /usr/lib/xorg/Xorg 4MiB |
| 6 N/A N/A 1755 G /usr/lib/xorg/Xorg 4MiB |
| 7 N/A N/A 1755 G /usr/lib/xorg/Xorg 4MiB |

Paul · February 19, 2021, 6:55pm

I was not able to figure this issue out on our systems, we abandoned the database and started over with a new one. Everything works as expected now.

spunjani · February 19, 2021, 8:48pm

Thanks for the update @Paul, and glad you were able to sort this out. We’ll update the post if we are able to uncover any other ideas on the root cause.

Navid · March 22, 2021, 5:08pm

We see a similar issue spontaneously after several hours of running fine w/ NU-Refinement (New), reproducible on multiple jobs with different box sizes and particle numbers on v3.1.0:

Traceback (most recent call last):
  File "cryosparc_worker/cryosparc_compute/run.py", line 84, in cryosparc_compute.run.main
  File "cryosparc_worker/cryosparc_compute/jobs/refine/newrun.py", line 466, in cryosparc_compute.jobs.refine.newrun.run_homo_refine
  File "cryosparc_worker/cryosparc_compute/jobs/refine/newrun.py", line 467, in cryosparc_compute.jobs.refine.newrun.run_homo_refine
  File "cryosparc_worker/cryosparc_compute/jobs/ctf_refinement/run.py", line 164, in cryosparc_compute.jobs.ctf_refinement.run.full_ctf_refine
  File "cryosparc_worker/cryosparc_compute/jobs/ctf_refinement/run.py", line 434, in cryosparc_compute.jobs.ctf_refinement.run.compute_phase_errors
  File "cryosparc_worker/cryosparc_compute/engine/newengine.py", line 448, in cryosparc_compute.engine.newengine.EngineThread.preprocess_image_data
  File "cryosparc_worker/cryosparc_compute/engine/newengine.py", line 442, in cryosparc_compute.engine.newengine.EngineThread.preprocess_image_data
  File "cryosparc_worker/cryosparc_compute/engine/newgfourier.py", line 22, in cryosparc_compute.engine.newgfourier.get_plan_R2C_2D
  File "/home/hiter/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/skcuda/fft.py", line 127, in __init__
    onembed, ostride, odist, self.fft_type, self.batch)
  File "/home/hiter/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/skcuda/cufft.py", line 742, in cufftMakePlanMany
    cufftCheckStatus(status)
  File "/home/hiter/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/skcuda/cufft.py", line 117, in cufftCheckStatus
    raise e
skcuda.cufft.cufftAllocFailed

CleoShen · March 23, 2021, 7:10pm

Hi Navid,

I have the same error with you when running 2D classification job, I’m wondering how did you figure it out at last?

Navid · March 24, 2021, 5:05pm

We have not been able to solve this issue yet.

stephan · March 24, 2021, 5:17pm

Hi @Navid, @CleoShen,

What OS are you running cryoSPARC on?