Apparently, I found I have cuda loaded by module. Once I tried
module unload cuda
before installing 3dflex things turned to work. I will update here if that changed or i got the same error again.
bassem
Apparently, I found I have cuda loaded by module. Once I tried
module unload cuda
before installing 3dflex things turned to work. I will update here if that changed or i got the same error again.
bassem
Replicated the error. it is affecting some micrographs and not all. the older version v4.1.0 did not produce same error
================= CRYOSPARCW ======= 2022-12-20 19:24:57.947508 =========
Project P46 Job J436
Master spgpu Port 39002
===========================================================================
========= monitor process now starting main process
MAINPROCESS PID 41734
MAIN PID 41734
extract.run cryosparc_compute.jobs.jobregister
========= monitor process now waiting for main process
***************************************************************
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
========= sending heartbeat
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
========= sending heartbeat
min: -147877.784576 max: 147933.160736
min: -23476.686974 max: 23480.147987
***************************************************************
========= main process now complete.
========= monitor process now complete.
more update. the error for not processing some of the images persist even after executing forcedeps on the worker. All images are processed if i revert to v4.1.0.
The command only work if cryosparcm not cryosparcw. is that what you meant?
cryosparcw
may not be in your $PATH
, so you may have to run
/opt/cryosparc/cryosparc_worker/bin/cryosparcw call /usr/bin/env | grep -v CRYOSPARC_LICENSE_ID
The grep
command prevents the display of your (confidential) CryoSPARC license id.
I am getting a similar error trying to extract particles with v4.1.1
Many micrographs fail with the following error:
Error occurred while processing micrograph S5/motioncorrected/FoilHole_13208278_Data_13091718_13091720_20200814_153325_fractions_patch_aligned_doseweighted.mrc
Traceback (most recent call last):
File “/scratch/cryosoft/cryosparc/cryosparc_worker/cryosparc_compute/jobs/pipeline.py”, line 60, in exec
return self.process(item)
File “/scratch/cryosoft/cryosparc/cryosparc_worker/cryosparc_compute/jobs/extract/run.py”, line 498, in process
result = extraction_gpu.do_extract_particles_single_mic_gpu(mic=mic, bg_bin=bg_bin,
File “/scratch/cryosoft/cryosparc/cryosparc_worker/cryosparc_compute/jobs/extract/extraction_gpu.py”, line 136, in do_extract_particles_single_mic_gpu
fft_plan = skcuda_fft.Plan(shape=(patch_size, patch_size),
File “/scratch/cryosoft/cryosparc/cryosparc_worker/cryosparc_compute/skcuda_internal/fft.py”, line 132, in init
self.worksize = cufft.cufftMakePlanMany(
File “/scratch/cryosoft/cryosparc/cryosparc_worker/cryosparc_compute/skcuda_internal/cufft.py”, line 749, in cufftMakePlanMany
cufftCheckStatus(status)
File “/scratch/cryosoft/cryosparc/cryosparc_worker/cryosparc_compute/skcuda_internal/cufft.py”, line 124, in cufftCheckStatus
raise e
cryosparc_compute.skcuda_internal.cufft.cufftInternalError
Marking S5/motioncorrected/FoilHole_13208278_Data_13091718_13091720_20200814_153325_fractions_patch_aligned_doseweighted.mrc as incomplete and continuing…
Hi,
I am getting the same error during particle extraction (with slightly different error message):
Error occurred while processing micrograph J1/imported/015139662710773897517_FoilHole_7700328_Data_7666634_7666636_20221208_055455_EER.mrc
Traceback (most recent call last):
File "/home/changliu/Applications/cryosparc/cryosparc_worker/cryosparc_compute/jobs/pipeline.py", line 60, in exec
return self.process(item)
File "/home/changliu/Applications/cryosparc/cryosparc_worker/cryosparc_compute/jobs/extract/run.py", line 498, in process
result = extraction_gpu.do_extract_particles_single_mic_gpu(mic=mic, bg_bin=bg_bin,
File "/home/changliu/Applications/cryosparc/cryosparc_worker/cryosparc_compute/jobs/extract/extraction_gpu.py", line 143, in do_extract_particles_single_mic_gpu
ifft_plan = skcuda_fft.Plan(shape = (bin_size, bin_size),
File "/home/changliu/Applications/cryosparc/cryosparc_worker/cryosparc_compute/skcuda_internal/fft.py", line 132, in __init__
self.worksize = cufft.cufftMakePlanMany(
File "/home/changliu/Applications/cryosparc/cryosparc_worker/cryosparc_compute/skcuda_internal/cufft.py", line 749, in cufftMakePlanMany
cufftCheckStatus(status)
File "/home/changliu/Applications/cryosparc/cryosparc_worker/cryosparc_compute/skcuda_internal/cufft.py", line 124, in cufftCheckStatus
raise e
cryosparc_compute.skcuda_internal.cufft.cufftInternalError
Marking J1/imported/015139662710773897517_FoilHole_7700328_Data_7666634_7666636_20221208_055455_EER.mrc as incomplete and continuing...
As I wrote in another post, this error seems to be related to GPU out of memory, because I noticed that the used GPU memory was about 5 GB when the job was initially launched and gradually increased as the job ran. Eventually, as one or more of the GPUs ran out of memory, I started to see the error messages. Using the CPU version of the job, particle extraction successfully completed for all the micrographs without any issue. And the system memory usage stayed about the same through the whole process of the job. It seems like the GPU version of the particle extraction job was unable to release the GPU memory after it finished extraction from previous batches of micrographs.
Thanks.
Hi,
Just an update: the new guide for installation now has this new requirement
“Nvidia driver version is 460.32.03 or newer on all GPU machines. Run nvidia-smi
to verify”
We have 3 workstation, the one machine that fullfil this requirement and CUDA 11.7 (plus other listed requirements) is giving no error. We are in the process of updating the nvidia drivers on the other workstations and will test if that is what we were missing.
Hi @Bassem, thanks for sharing this information. We actually already have relatively new versions of Nvidia driver (520.61.05) and CUDA (11.8) installed in both of our workstations that gave this error.
Another thing I noticed is that the error generally happens when the extraction job reaches 2000-3000 micrographs. This is also when my GPU runs out of VRAM (24 GB). So depending on the amount of VRAM your GPUs have, you may not have this error, especially if you extract from a relatively small dataset. But I would definitely be interested in knowing if updating the nvidia drivers on the other workstations you have solves this error.
Thanks.
Will update @YYang . i am also waiting on cryosparc team related to some diagnostics that i submitted.
The update on the other machine did not help with the error during particle extraction
I tried to keep an eye on the GPU memory. With v4.1.0 the max usage I noticed on any given GPU during that job is ~7GB/24GB while with v4.1.1 it went all way up to 19-20GB/24GB doing the same job. I do not know what that means, I am stretching my nerdy computer self here.
I downgraded the cryosparc to v4.1.0. Now the particle extraction job can finish without an issue. The GPU memory usage stayed about the same (~7-8 GB, consistent with what you observed) through the whole process of the job.
Hi @YYang, @Bassem @wangyan16 @RD_Cryo,
Can you please provide us with:
nvidia-smi
uname -a
Whether you ran cryoparcw install-3dflex
or not
the instance_information
field from the failing job’s metadata
1- The error happens without installing 3dflex dependencies just with updating to v4.1.1.
2- output of nvidia-smi
Thu Dec 29 21:16:30 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.57 Driver Version: 515.57 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:1B:00.0 Off | N/A |
| 30% 28C P0 104W / 350W | 0MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... Off | 00000000:1C:00.0 Off | N/A |
| 30% 29C P0 115W / 350W | 0MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA GeForce ... Off | 00000000:1D:00.0 Off | N/A |
| 30% 30C P0 108W / 350W | 0MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA GeForce ... Off | 00000000:1E:00.0 Off | N/A |
| 30% 29C P0 103W / 350W | 0MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA GeForce ... Off | 00000000:B2:00.0 Off | N/A |
| 30% 28C P0 105W / 350W | 0MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA GeForce ... Off | 00000000:B3:00.0 Off | N/A |
| 30% 28C P0 106W / 350W | 0MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA GeForce ... Off | 00000000:B4:00.0 Off | N/A |
| 30% 28C P0 101W / 350W | 0MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA GeForce ... Off | 00000000:B5:00.0 Off | N/A |
| 30% 28C P0 106W / 350W | 0MiB / 24576MiB | 1% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
3- Linux spgpu 3.10.0-1160.49.1.el7.x86_64 #1 SMP Tue Nov 30 15:51:32 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
4- instance information → that was deleted by the user when we rolled back to v4.1.0 but i have the entire job report folder which part can i share with you?
Hi @stephan ,
nvidia-smi
Sun Jan 1 04:02:57 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.56.06 Driver Version: 520.56.06 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A5500 Off | 00000000:01:00.0 On | Off |
| 80% 55C P2 124W / 230W | 10338MiB / 24564MiB | 12% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX A5500 Off | 00000000:2B:00.0 Off | Off |
| 30% 43C P8 21W / 230W | 8MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA RTX A5500 Off | 00000000:41:00.0 Off | Off |
| 30% 44C P8 19W / 230W | 8MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA RTX A5500 Off | 00000000:61:00.0 Off | Off |
| 80% 36C P8 22W / 230W | 8MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2912 G /usr/lib/xorg/Xorg 352MiB |
| 0 N/A N/A 3044 G /usr/bin/gnome-shell 84MiB |
| 0 N/A N/A 4804 G ...903988018181927716,131072 131MiB |
| 0 N/A N/A 729147 C python 9646MiB |
| 1 N/A N/A 2912 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 2912 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 2912 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------+
uname -a
Linux cryows1 5.15.0-56-generic #62-Ubuntu SMP Tue Nov 22 19:54:14 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
I have not installed 3DFlex Dependencies yet.
Instance_informance
"instance_information": {
"platform_node": "cryows1",
"platform_release": "5.15.0-56-generic",
"platform_version": "#62-Ubuntu SMP Tue Nov 22 19:54:14 UTC 2022",
"platform_architecture": "x86_64",
"physical_cores": 32,
"max_cpu_freq": 3600.0,
"total_memory": "251.53GB",
"available_memory": "230.69GB",
"used_memory": "18.55GB"
}
Hi @YYang, @Bassem @wangyan16 @RD_Cryo,
Thank you all for your help with this.
We’ve released a patch for v4.1.1 that fixes this issue. Please see:
It is running smoothly after the patch. Thanks to you and cryoSPARC team for looking into this.
cheers,
bassem