Error during local motion correction job

AndreGraca · February 22, 2022, 10:01pm

I have encountered an unusual error during local motion correction which (interestingly) only happens after 4 hours of running the job.

In this particular machine, only CUDA 11.5 is installed at the moment and I realized that even the latest cryoSPARC is still not adapted to versions superior to CUDA 11.2. While this may indeed be the reason for my error and the fix might be about installing and configuring cryoSPARC to CUDA 11.2, I decided to give it a shot here: do you think it could be something else? Otherwise, I need to contact the admin of this machine.
(Btw, most jobs requiring GPU run without problems using CUDA 11.5)

Please, have a look at the following.

Current cryoSPARC version:

v3.3.1+220215

Error appearing on the overview window of the job:

[CPU: 353.8 MB] Traceback (most recent call last):
File “cryosparc_worker/cryosparc_compute/run.py”, line 85, in cryosparc_compute.run.main
File “cryosparc_master/cryosparc_compute/jobs/motioncorrection/run_local.py”, line 378, in cryosparc_compute.jobs.motioncorrection.run_local.run_local_motion_correction_multi
File “/home/angr5008/Software/cryosparc/cryosparc_worker/cryosparc_compute/dataset.py”, line 588, in subset_mask
assert len(mask) == len(self)
AssertionError

The end of job log file:

========= sending heartbeat
========= sending heartbeat
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
**** handle exception rc
set status to failed
========= main process now complete.
========= monitor process now complete.

OS version:

Linux R2D2 5.13.0-30-generic #33~20.04.1-Ubuntu SMP Mon Feb 7 14:25:10 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Memory at the time:

    total  used   free  shared   buff/cache   available
Mem: 263786216 6802516 3504520 21948 253479180 254689732
Swap: 134217724 2825984 131391740

nvcc --version:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

nvidia-smi:

Tue Feb 22 09:48:48 2022
±----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce … On | 00000000:03:00.0 On | N/A |
| 27% 34C P8 19W / 250W | 411MiB / 7979MiB | 2% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 NVIDIA GeForce … On | 00000000:21:00.0 Off | N/A |
| 30% 41C P8 24W / 350W | 6MiB / 24268MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 2 NVIDIA GeForce … On | 00000000:4B:00.0 Off | N/A |
| 27% 26C P8 8W / 250W | 6MiB / 7982MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2140 G /usr/lib/xorg/Xorg 188MiB |
| 0 N/A N/A 2423 G /usr/bin/gnome-shell 34MiB |
| 0 N/A N/A 2982 G …AAAAAAAAA= --shared-files 16MiB |
| 0 N/A N/A 6493 G /usr/lib/firefox/firefox 167MiB |
| 1 N/A N/A 2140 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 2140 G /usr/lib/xorg/Xorg 4MiB |
±----------------------------------------------------------------------------+

Thanks in advance for your opinion!
André

wtempel · February 23, 2022, 6:09pm

@AndreGraca We suspect you have encountered a bug that’s triggered by the “Only process this many movies” parameter, and have added it to our TODO list. Thank you for bringing this problem to our attention.

AndreGraca · February 24, 2022, 12:17am

Thanks @wtempel

I am glad I can be of some help

In fact I was going to give an update here few hours ago:

I installed CUDA 11.2 and recompiled cryoSPARC with it, then run the job which gave the same error, so it does not seem related to the CUDA version

Somehow you figured out that it was related to “Only process this many movies” parameter. Could you explain a little bit this error and how you came to suspect about this parameter being in the origin of the problem?

Also, is it then possible to go around the problem by ticking the “Only process this many movies” parameter and give the full number of movies we would like to process?

wtempel · February 24, 2022, 5:05pm

@AndreGraca Your posting of the AssertionError details and my co-worker made short work of identifying the problem.
As for a temporary workaround, Only process this many movies should be left blank, while the input should be pared down to a desired set in advance, perhaps with a job of type Exposure Sets Tool or Manuall Curate Exposures.

AndreGraca · February 28, 2022, 6:00pm

@wtempel how is that workaround supposed to help me?
I have always left “Only process this many movies” blank. In this way, I also do not see how the problem is related to the error if you say it is triggered by this parameter.

Do I miss something, or did you misunderstand something?

wtempel · February 28, 2022, 9:50pm

@AndreGraca Unfortunately, this workaround would not help if the AssertionError did not arise from the Only process this many movies parameter.
An alternative cause for the error could be the input of movies without any associated particles. You may test for (and correct) this condition by connecting particles (in addition to rigid/patch-corrected micrographs) to a Manually Curate Exposures job that precedes Local Motion Correction.
Does this condition apply in your case?

AndreGraca · March 3, 2022, 4:55pm

@wtempel That was it!

After I excluded all the micrographs that had no particles associated, the job run with no problems, as all the following jobs until the end of the planned workflow. Hummm… is this a temporary hiccup problem of local motion correction jobs, or am I supposed to be vigilant for this situation in other types of jobs?

Thanks for the help, you can mark it as ‘solved’ if you will.