cufftAllocFailed during particle extraction (v4.1.1)

Bassem · December 20, 2022, 9:14pm

So i performed forcedeps and I am getting this during particle extraction with v4.1.1

Error occurred while processing micrograph J712/motioncorrected/002609194105202031183_FoilHole_12530805_Data_12522960_12522962_20221212_180445_Fractions_patch_aligned_doseweighted.mrc
Traceback (most recent call last):
  File "/opt/cryosparc/cryosparc_worker/cryosparc_compute/jobs/pipeline.py", line 60, in exec
    return self.process(item)
  File "/opt/cryosparc/cryosparc_worker/cryosparc_compute/jobs/extract/run.py", line 498, in process
    result = extraction_gpu.do_extract_particles_single_mic_gpu(mic=mic, bg_bin=bg_bin,
  File "/opt/cryosparc/cryosparc_worker/cryosparc_compute/jobs/extract/extraction_gpu.py", line 143, in do_extract_particles_single_mic_gpu
    ifft_plan = skcuda_fft.Plan(shape = (bin_size, bin_size),
  File "/opt/cryosparc/cryosparc_worker/cryosparc_compute/skcuda_internal/fft.py", line 132, in __init__
    self.worksize = cufft.cufftMakePlanMany(
  File "/opt/cryosparc/cryosparc_worker/cryosparc_compute/skcuda_internal/cufft.py", line 749, in cufftMakePlanMany
    cufftCheckStatus(status)
  File "/opt/cryosparc/cryosparc_worker/cryosparc_compute/skcuda_internal/cufft.py", line 124, in cufftCheckStatus
    raise e
cryosparc_compute.skcuda_internal.cufft.cufftAllocFailed

Marking J712/motioncorrected/002609194105202031183_FoilHole_12530805_Data_12522960_12522962_20221212_180445_Fractions_patch_aligned_doseweighted.mrc as incomplete and continuing...

wtempel · December 20, 2022, 9:40pm

Please can you post the content of job.log.

Bassem · December 21, 2022, 12:13am

Apparently, I found I have cuda loaded by module. Once I tried

module unload cuda

before installing 3dflex things turned to work. I will update here if that changed or i got the same error again.
bassem

Bassem · December 21, 2022, 12:51am

Replicated the error. it is affecting some micrographs and not all. the older version v4.1.0 did not produce same error


================= CRYOSPARCW =======  2022-12-20 19:24:57.947508  =========
Project P46 Job J436
Master spgpu Port 39002
===========================================================================
========= monitor process now starting main process
MAINPROCESS PID 41734
MAIN PID 41734
extract.run cryosparc_compute.jobs.jobregister
========= monitor process now waiting for main process
***************************************************************
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty 
========= sending heartbeat
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty 
========= sending heartbeat
min: -147877.784576 max: 147933.160736
min: -23476.686974 max: 23480.147987
***************************************************************
========= main process now complete.
========= monitor process now complete.

Bassem · December 21, 2022, 3:39pm

more update. the error for not processing some of the images persist even after executing forcedeps on the worker. All images are processed if i revert to v4.1.0.

wtempel · December 21, 2022, 4:53pm

Please can you email us:

the job report for P46 J436
the output of
cryosparcw call /usr/bin/env

Bassem · December 22, 2022, 4:00pm

The command only work if cryosparcm not cryosparcw. is that what you meant?

wtempel · December 22, 2022, 5:00pm

cryosparcw may not be in your $PATH, so you may have to run

/opt/cryosparc/cryosparc_worker/bin/cryosparcw call /usr/bin/env | grep -v CRYOSPARC_LICENSE_ID

The grep command prevents the display of your (confidential) CryoSPARC license id.

Bassem · December 22, 2022, 7:35pm

Done and sent it with my previous message.

Thanks @wtempel

RD_Cryo · December 25, 2022, 2:14pm

I am getting a similar error trying to extract particles with v4.1.1
Many micrographs fail with the following error:

Error occurred while processing micrograph S5/motioncorrected/FoilHole_13208278_Data_13091718_13091720_20200814_153325_fractions_patch_aligned_doseweighted.mrc
Traceback (most recent call last):
File “/scratch/cryosoft/cryosparc/cryosparc_worker/cryosparc_compute/jobs/pipeline.py”, line 60, in exec
return self.process(item)
File “/scratch/cryosoft/cryosparc/cryosparc_worker/cryosparc_compute/jobs/extract/run.py”, line 498, in process
result = extraction_gpu.do_extract_particles_single_mic_gpu(mic=mic, bg_bin=bg_bin,
File “/scratch/cryosoft/cryosparc/cryosparc_worker/cryosparc_compute/jobs/extract/extraction_gpu.py”, line 136, in do_extract_particles_single_mic_gpu
fft_plan = skcuda_fft.Plan(shape=(patch_size, patch_size),
File “/scratch/cryosoft/cryosparc/cryosparc_worker/cryosparc_compute/skcuda_internal/fft.py”, line 132, in init
self.worksize = cufft.cufftMakePlanMany(
File “/scratch/cryosoft/cryosparc/cryosparc_worker/cryosparc_compute/skcuda_internal/cufft.py”, line 749, in cufftMakePlanMany
cufftCheckStatus(status)
File “/scratch/cryosoft/cryosparc/cryosparc_worker/cryosparc_compute/skcuda_internal/cufft.py”, line 124, in cufftCheckStatus
raise e
cryosparc_compute.skcuda_internal.cufft.cufftInternalError

Marking S5/motioncorrected/FoilHole_13208278_Data_13091718_13091720_20200814_153325_fractions_patch_aligned_doseweighted.mrc as incomplete and continuing…

YYang · December 26, 2022, 6:17pm

Hi,

I am getting the same error during particle extraction (with slightly different error message):

Error occurred while processing micrograph J1/imported/015139662710773897517_FoilHole_7700328_Data_7666634_7666636_20221208_055455_EER.mrc
Traceback (most recent call last):
  File "/home/changliu/Applications/cryosparc/cryosparc_worker/cryosparc_compute/jobs/pipeline.py", line 60, in exec
    return self.process(item)
  File "/home/changliu/Applications/cryosparc/cryosparc_worker/cryosparc_compute/jobs/extract/run.py", line 498, in process
    result = extraction_gpu.do_extract_particles_single_mic_gpu(mic=mic, bg_bin=bg_bin,
  File "/home/changliu/Applications/cryosparc/cryosparc_worker/cryosparc_compute/jobs/extract/extraction_gpu.py", line 143, in do_extract_particles_single_mic_gpu
    ifft_plan = skcuda_fft.Plan(shape = (bin_size, bin_size),
  File "/home/changliu/Applications/cryosparc/cryosparc_worker/cryosparc_compute/skcuda_internal/fft.py", line 132, in __init__
    self.worksize = cufft.cufftMakePlanMany(
  File "/home/changliu/Applications/cryosparc/cryosparc_worker/cryosparc_compute/skcuda_internal/cufft.py", line 749, in cufftMakePlanMany
    cufftCheckStatus(status)
  File "/home/changliu/Applications/cryosparc/cryosparc_worker/cryosparc_compute/skcuda_internal/cufft.py", line 124, in cufftCheckStatus
    raise e
cryosparc_compute.skcuda_internal.cufft.cufftInternalError

Marking J1/imported/015139662710773897517_FoilHole_7700328_Data_7666634_7666636_20221208_055455_EER.mrc as incomplete and continuing...

As I wrote in another post, this error seems to be related to GPU out of memory, because I noticed that the used GPU memory was about 5 GB when the job was initially launched and gradually increased as the job ran. Eventually, as one or more of the GPUs ran out of memory, I started to see the error messages. Using the CPU version of the job, particle extraction successfully completed for all the micrographs without any issue. And the system memory usage stayed about the same through the whole process of the job. It seems like the GPU version of the particle extraction job was unable to release the GPU memory after it finished extraction from previous batches of micrographs.

Thanks.

Bassem · December 26, 2022, 6:33pm

Hi,

Just an update: the new guide for installation now has this new requirement
“Nvidia driver version is 460.32.03 or newer on all GPU machines. Run nvidia-smi to verify”

We have 3 workstation, the one machine that fullfil this requirement and CUDA 11.7 (plus other listed requirements) is giving no error. We are in the process of updating the nvidia drivers on the other workstations and will test if that is what we were missing.

YYang · December 26, 2022, 7:09pm

Hi @Bassem, thanks for sharing this information. We actually already have relatively new versions of Nvidia driver (520.61.05) and CUDA (11.8) installed in both of our workstations that gave this error.

Another thing I noticed is that the error generally happens when the extraction job reaches 2000-3000 micrographs. This is also when my GPU runs out of VRAM (24 GB). So depending on the amount of VRAM your GPUs have, you may not have this error, especially if you extract from a relatively small dataset. But I would definitely be interested in knowing if updating the nvidia drivers on the other workstations you have solves this error.

Thanks.

Bassem · December 26, 2022, 8:00pm

Will update @YYang . i am also waiting on cryosparc team related to some diagnostics that i submitted.

Bassem · December 27, 2022, 2:54pm

The update on the other machine did not help with the error during particle extraction

I tried to keep an eye on the GPU memory. With v4.1.0 the max usage I noticed on any given GPU during that job is ~7GB/24GB while with v4.1.1 it went all way up to 19-20GB/24GB doing the same job. I do not know what that means, I am stretching my nerdy computer self here.

wangyan16 · December 28, 2022, 5:58pm

Similar error

YYang · December 28, 2022, 6:43pm

I downgraded the cryosparc to v4.1.0. Now the particle extraction job can finish without an issue. The GPU memory usage stayed about the same (~7-8 GB, consistent with what you observed) through the whole process of the job.

stephan · December 29, 2022, 10:18pm

Hi @YYang, @Bassem @wangyan16 @RD_Cryo,

Can you please provide us with:
nvidia-smi
uname -a
Whether you ran cryoparcw install-3dflex or not
the instance_information field from the failing job’s metadata

Bassem · December 30, 2022, 2:23am

1- The error happens without installing 3dflex dependencies just with updating to v4.1.1.
2- output of nvidia-smi

Thu Dec 29 21:16:30 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.57       Driver Version: 515.57       CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:1B:00.0 Off |                  N/A |
| 30%   28C    P0   104W / 350W |      0MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:1C:00.0 Off |                  N/A |
| 30%   29C    P0   115W / 350W |      0MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce ...  Off  | 00000000:1D:00.0 Off |                  N/A |
| 30%   30C    P0   108W / 350W |      0MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce ...  Off  | 00000000:1E:00.0 Off |                  N/A |
| 30%   29C    P0   103W / 350W |      0MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA GeForce ...  Off  | 00000000:B2:00.0 Off |                  N/A |
| 30%   28C    P0   105W / 350W |      0MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA GeForce ...  Off  | 00000000:B3:00.0 Off |                  N/A |
| 30%   28C    P0   106W / 350W |      0MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA GeForce ...  Off  | 00000000:B4:00.0 Off |                  N/A |
| 30%   28C    P0   101W / 350W |      0MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA GeForce ...  Off  | 00000000:B5:00.0 Off |                  N/A |
| 30%   28C    P0   106W / 350W |      0MiB / 24576MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

3- Linux spgpu 3.10.0-1160.49.1.el7.x86_64 #1 SMP Tue Nov 30 15:51:32 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

4- instance information → that was deleted by the user when we rolled back to v4.1.0 but i have the entire job report folder which part can i share with you?

YYang · January 1, 2023, 9:16am

Hi @stephan ,

The output of nvidia-smi

Sun Jan  1 04:02:57 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.56.06    Driver Version: 520.56.06    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A5500    Off  | 00000000:01:00.0  On |                  Off |
| 80%   55C    P2   124W / 230W |  10338MiB / 24564MiB |     12%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A5500    Off  | 00000000:2B:00.0 Off |                  Off |
| 30%   43C    P8    21W / 230W |      8MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A5500    Off  | 00000000:41:00.0 Off |                  Off |
| 30%   44C    P8    19W / 230W |      8MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A5500    Off  | 00000000:61:00.0 Off |                  Off |
| 80%   36C    P8    22W / 230W |      8MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2912      G   /usr/lib/xorg/Xorg                352MiB |
|    0   N/A  N/A      3044      G   /usr/bin/gnome-shell               84MiB |
|    0   N/A  N/A      4804      G   ...903988018181927716,131072      131MiB |
|    0   N/A  N/A    729147      C   python                           9646MiB |
|    1   N/A  N/A      2912      G   /usr/lib/xorg/Xorg                  4MiB |
|    2   N/A  N/A      2912      G   /usr/lib/xorg/Xorg                  4MiB |
|    3   N/A  N/A      2912      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+

Output of uname -a

Linux cryows1 5.15.0-56-generic #62-Ubuntu SMP Tue Nov 22 19:54:14 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

I have not installed 3DFlex Dependencies yet.
Instance_informance

"instance_information": {
        "platform_node": "cryows1",
        "platform_release": "5.15.0-56-generic",
        "platform_version": "#62-Ubuntu SMP Tue Nov 22 19:54:14 UTC 2022",
        "platform_architecture": "x86_64",
        "physical_cores": 32,
        "max_cpu_freq": 3600.0,
        "total_memory": "251.53GB",
        "available_memory": "230.69GB",
        "used_memory": "18.55GB"
    }