3D classification (beta) crashes without error message

Hi,

I am trying to use the 3D classification (beta) job. I use a set of particles from NU refinement as input. The job starts and runs for a while, the output looks normal and then it suddenly crashes without error message:

[CPU: 3.35 GB]   Finished iteration 67 in 10.655s. Total time so far 952.418s
[CPU: 3.35 GB]   -----------------------------  Iteration 68 
[CPU: 3.35 GB]    Class Similarity: 0.000
[CPU: 3.35 GB]   Spooled a batch of 10000 particles.
[CPU: 3.35 GB]     Learning rate 0.001
[CPU: 3.35 GB]   Solving for new structures...
[CPU: 91.1 MB]   ====== Job process terminated abnormally.

I have tried different options for initialisation mode (both simple and pca), number of classes (10 and 20) as well as with and without providing a mask.

My cryosparc version is v3.3.1+220118
Is there any log file where I might find a more detailed error message?
Thanks for your help!

Hi @gha,

Can you post the output of the job log? You can use the following command to output the last 100 lines to a text file:

cryosparcm joblog PXX JYY | tail -100 > log.txt

Thanks!
Valentin

Dear Valentin,

thanks for your help! Here are the last 100 lines of the job log:

exception in cufft.Plan.__del__: 
exception in force_free_cufft_plan: 'NoneType' object has no attribute 'handle'
global compute_resid_pow with (500, 10, 1, 1) 11878
global compute_resid_pow with (500, 10, 1, 1) 11878
global compute_resid_pow with (500, 10, 1, 1) 11878
global compute_resid_pow with (500, 10, 1, 1) 11878
global compute_resid_pow with (500, 10, 1, 1) 11878
global compute_resid_pow with (500, 10, 1, 1) 11878
global compute_resid_pow with (500, 10, 1, 1) 11878
global compute_resid_pow with (500, 10, 1, 1) 11878
global compute_resid_pow with (500, 10, 1, 1) 11878
global compute_resid_pow with (500, 10, 1, 1) 11878
global compute_resid_pow with (500, 10, 1, 1) 11878
global compute_resid_pow with (500, 10, 1, 1) 11878
global compute_resid_pow with (500, 10, 1, 1) 11878
global compute_resid_pow with (500, 10, 1, 1) 11878
global compute_resid_pow with (500, 10, 1, 1) 11878
global compute_resid_pow with (500, 10, 1, 1) 11878
global compute_resid_pow with (500, 10, 1, 1) 11878
global compute_resid_pow with (500, 10, 1, 1) 11878
global compute_resid_pow with (500, 10, 1, 1) 11878
global compute_resid_pow with (500, 10, 1, 1) 11878
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty 
exception in cufft.Plan.__del__: 
exception in force_free_cufft_plan: 'NoneType' object has no attribute 'handle'
exception in force_free_cufft_plan: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty 
exception in cufft.Plan.__del__: 
exception in force_free_cufft_plan: 'NoneType' object has no attribute 'handle'
exception in force_free_cufft_plan: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty 
exception in cufft.Plan.__del__: 
exception in force_free_cufft_plan: 'NoneType' object has no attribute 'handle'
exception in force_free_cufft_plan: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty 
exception in cufft.Plan.__del__: 
exception in force_free_cufft_plan: 'NoneType' object has no attribute 'handle'
exception in force_free_cufft_plan: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty 
exception in cufft.Plan.__del__: 
exception in force_free_cufft_plan: 'NoneType' object has no attribute 'handle'
exception in force_free_cufft_plan: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty 
exception in cufft.Plan.__del__: 
exception in force_free_cufft_plan: 'NoneType' object has no attribute 'handle'
exception in force_free_cufft_plan: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty 
exception in cufft.Plan.__del__: 
exception in force_free_cufft_plan: 'NoneType' object has no attribute 'handle'
exception in force_free_cufft_plan: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty 
exception in cufft.Plan.__del__: 
exception in force_free_cufft_plan: 'NoneType' object has no attribute 'handle'
exception in force_free_cufft_plan: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty 
exception in cufft.Plan.__del__: 
exception in force_free_cufft_plan: 'NoneType' object has no attribute 'handle'
exception in force_free_cufft_plan: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty 
exception in cufft.Plan.__del__: 
exception in force_free_cufft_plan: 'NoneType' object has no attribute 'handle'
exception in force_free_cufft_plan: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty 
exception in cufft.Plan.__del__: 
exception in force_free_cufft_plan: 'NoneType' object has no attribute 'handle'
global compute_resid_pow with (500, 10, 1, 1) 11878
global compute_resid_pow with (500, 10, 1, 1) 11878
global compute_resid_pow with (500, 10, 1, 1) 11878
global compute_resid_pow with (500, 10, 1, 1) 11878
global compute_resid_pow with (500, 10, 1, 1) 11878
global compute_resid_pow with (500, 10, 1, 1) 11878
global compute_resid_pow with (500, 10, 1, 1) 11878
global compute_resid_pow with (500, 10, 1, 1) 11878
global compute_resid_pow with (500, 10, 1, 1) 11878
global compute_resid_pow with (500, 10, 1, 1) 11878
global compute_resid_pow with (500, 10, 1, 1) 11878
global compute_resid_pow with========= sending heartbeat
malloc_consolidate(): invalid chunk size
========= main process now complete.
========= monitor process now complete.

Cheers,
Gregor

Strange. We haven’t seen this error before. Could you also provide the output of the following command (run on the worker node) please?

uname -a && free && nvidia-smi

Valentin

uname -a && free && nvidia-smi
Linux echo 5.4.0-91-generic #102-Ubuntu SMP Fri Nov 5 16:31:28 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
              total        used        free      shared  buff/cache   available
Mem:      528216880     4249644    42016092       38320   481951144   520512824
Swap:       8388604      144384     8244220
Mon Jan 31 11:34:53 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.86       Driver Version: 470.86       CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| 30%   25C    P8    17W / 350W |      5MiB / 24268MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:25:00.0 Off |                  N/A |
| 30%   26C    P8    14W / 350W |      5MiB / 24268MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce ...  Off  | 00000000:C1:00.0 Off |                  N/A |
| 30%   24C    P8    13W / 350W |      5MiB / 24268MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce ...  Off  | 00000000:E1:00.0 Off |                  N/A |
| 30%   24C    P8    18W / 350W |      5MiB / 24268MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      3056      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A      3056      G   /usr/lib/xorg/Xorg                  4MiB |
|    2   N/A  N/A      3056      G   /usr/lib/xorg/Xorg                  4MiB |
|    3   N/A  N/A      3056      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+

@gha,

Thanks! Looks like you’re using CUDA 11.4 – cryoSPARC officially only supports <= 11.2. Could you downgrade and see if that fixes the issue?

Thanks for your help Valentin!
I actually used the job a couple of times before I patched cryosparc two weeks ago and it worked. Was anything with respect to this job changed by the patch?

I think I had to use the newest CUDA version because of the RTX3090 cards. If I remember correctly, they did not work with the older versions of CUDA.

Cheers,
Gregor

@gha Has your problem been resolved?
If not, you may try if configuring cryoSPARC to use CUDA-11.2 resolves the problem. The CUDA versions used by cryoSPARC and displayed by nvidia-smi need not be identical.
To minimize side effects, such as interference with the existing nvidia driver, you may

  • use the “runfile”, as opposed to ubuntu package, distribution
  • install as a non-root user at a path readable to the cryoSPARC worker(s)
  • limit the cuda install to the “toolkit” component.

An example is shown elsewhere, but needs to be adjusted (correct file name, installation path etc.)
Following successful installation of the toolkit, please update your worker configuration(s) with
cryosparcw newcuda <path>