Reference-based motion correction crashing

kpsleung · January 12, 2024, 5:15pm

Hi all,

I have extracted particles from 2 datasets in relion and loaded into cryoSPARC for further analysis. When I got to the point of using RBMC, RBMC always crash in one of the datasets with error message "
====== Job process terminated abnormally."

After re-running several times and checking job.log, one time the error message was assertion error: nmov > movie_no. This time it showed gain_ref_blob/flip_x: invalid handle 1125899906842637, no heap at index 13 (errno 2: No such file or directory).

I have checked that all soft links to the movies are not broken. I also tried to remove the movie that may cause the crash from processing, but of no avail. Any suggestions to troubleshoot?

wtempel · January 12, 2024, 5:46pm

@kpsleung Please can you post the version and patch level for you CryoSPARC installation, and the complete error trace.

kpsleung · January 12, 2024, 6:06pm

The version is v4.4.1. The following is the end of the job.log:

ElectronCountedFramesDecompressor::prepareRead: found 1085 frames in EER-TIFF file.

refmotion worker 0 (NVIDIA RTX A4000)
BFGS iterations:      300
scale (alpha):        4.861825
noise model (sigma2): 39.205906
     TIME (s)  SECTION
noise model (sigma2): 39.205906
     TIME (s)  SECTION
  0.000145032  sanity
  9.167947948  read movie
  0.045142473  get gain, defects
  0.060268143  read bg
  0.002517539  read rigid
  0.893805941  prep_movie
  0.407442085  extract from frames
  0.000571716  extract from refs
  0.000000474  adj
  0.000000150  bfactor
  0.007225606  rigid motion correct
  0.000213075  get noise, scale
  0.738295133  optimize trajectory
  0.155823695  shift_sum patches
  0.002365920  ifft
  0.001661418  unpad
  0.000250805  fill out dataset
  0.011622373  write output files
 11.495299527  --- TOTAL ---

ElectronCountedFramesDecompressor: reading using TIFF-EER mode.
TIFFReadDirectory: Unknown field with tag 65002 (0xfdea) encountered     (repeated a lot of lines)
ElectronCountedFramesDecompressor::prepareRead: found 1085 frames in EER-TIFF file.
gain_ref_blob/flip_x: invalid handle 1125899906842637, no heap at index 13 (errno 2: No such file or directory)
========= main process now complete at 2024-01-12 18:16:10.201676.
========= monitor process now complete at 2024-01-12 18:16:20.703899.

kpsleung · January 13, 2024, 3:12pm

@wtempel here is another set of error messages from job.log in the recent crash:

ElectronCountedFramesDecompressor: reading using TIFF-EER mode.
TIFFReadDirectory: Unknown field with tag 65002 (0xfdea) encountered       [repeat a lot of times]
HOST ALLOCATION FUNCTION: using numba.cuda.pinned_array
**** handle exception rc
TIFFReadDirectory: Unknown field with tag 65002 (0xfdea) encountered
TIFFReadDirectory: Unknown field with tag 65002 (0xfdea) encountered
TIFFReadDirectory: Unknown field with tag 65002 (0xfdea) encountered
TIFFReadDirectory: Unknown field with tag 65002 (0xfdea) encountered
TIFFReadDirectory: Unknown field with tag 65002 (0xfdea) encountered
TIFFReadDirectory: Unknown field with tag 65002 (0xfdea) encountered
TIFFReadDirectory: Unknown field with tag 65002 (0xfdea) encountered
TIFFReadDirectory: Unknown field with tag 65002 (0xfdea) encountered
/home/wcyl/cryosparc_worker/cryosparc_compute/jobs/motioncorrection/mic_utils.py:95: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  @jit(nogil=True)
/home/wcyl/cryosparc_worker/cryosparc_compute/micrographs.py:563: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  def contrast_normalization(arr_bin, tile_size = 128):
/home/wcyl/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py:2919: UserWarning: NVRTC log messages whilst compiling kernel:

kernel(963): warning #177-D: variable "Nb2p1" was declared but never referenced


  warnings.warn(msg)
/home/wcyl/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/multiprocessing/process.py:108: UserWarning: Kernel function slice_volume called with very small array <cryosparc_compute.gpu.gpuarray.GPUArray object at 0x7f0dd8252310> (size 2). Array will be passed to kernel as a pointer. Consider modifying the kernel to accept individual scalar arguments instead.
  self._target(*self._args, **self._kwargs)

[this message repeats a lot of times, where the addresses start 0x7... varies and size varies from 1 to 3]
 
/home/wcyl/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numpy/ctypeslib.py:518: RuntimeWarning: A builtin ctypes object gave a PEP3118 format string that does not match its itemsize, so a best-guess will be made of the data type. Newer versions of python may behave correctly.
  return asarray(obj)
Traceback (most recent call last):
  File "cryosparc_master/cryosparc_compute/run.py", line 95, in cryosparc_master.cryosparc_compute.run.main
  File "cryosparc_master/cryosparc_compute/jobs/motioncorrection/run_reference_motion.py", line 495, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.run_reference_motion.run_reference_motion_correction
  File "cryosparc_master/cryosparc_compute/jobs/motioncorrection/refmotion.py", line 1220, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.refmotion.mainfn_reconstruct
  File "cryosparc_master/cryosparc_compute/jobs/motioncorrection/refmotion.py", line 1235, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.refmotion.mainfn_reconstruct
  File "cryosparc_master/cryosparc_compute/jobs/motioncorrection/refmotion.py", line 610, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.refmotion.slice_vol
  File "/home/wcyl/cryosparc_worker/cryosparc_compute/particles.py", line 70, in init
    self.N_input = int(self[self.blob_key + '/shape'][0,0])
IndexError: index 0 is out of bounds for axis 0 with size 0
set status to failed
========= main process now complete at 2024-01-13 22:31:46.834807.
========= monitor process now complete at 2024-01-13 22:31:46.840172.

hsnyder · February 13, 2024, 9:17pm

Hi @kpsleung. Are you still experiencing these issues? And just to clarify, you’re getting different error messages each re-run, even if you just re-run the job without changing any parameters or inputs?

kpsleung · February 15, 2024, 3:54pm

Hi @hsnyder. Running the same job with the same input and parameters gave different error messages. For now, it seems that the problem can be smoothened out by redo extraction and remove any particles on edge.

sjcalise · July 23, 2024, 6:35pm

Hi folks, I have gotten a similar error with RBMC a couple times now. Here’s all the job.log says below. Any thoughts?

**** handle exception rc
set status to failed
Traceback (most recent call last):
  File "cryosparc_master/cryosparc_compute/run.py", line 115, in cryosparc_master.cryosparc_compute.run.main
  File "cryosparc_master/cryosparc_compute/jobs/motioncorrection/run_reference_motion.py", line 469, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.run_reference_motion.run_reference_motion_correction
  File "cryosparc_master/cryosparc_compute/jobs/motioncorrection/refmotion.py", line 1259, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.refmotion.mainfn_reconstruct
  File "cryosparc_master/cryosparc_compute/jobs/motioncorrection/refmotion.py", line 1275, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.refmotion.mainfn_reconstruct
  File "cryosparc_master/cryosparc_compute/jobs/motioncorrection/refmotion.py", line 637, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.refmotion.slice_vol
  File "/opt/applications/cryosparc2_worker/cryosparc_compute/particles.py", line 70, in init
    self.N_input = int(self[self.blob_key + '/shape'][0,0])
IndexError: index 0 is out of bounds for axis 0 with size 0
srun: error: emnoded38: task 0: Exited with exit code 1

wtempel · July 23, 2024, 7:27pm

Do you get this specific error

sjcalise:

  File "/opt/applications/cryosparc2_worker/cryosparc_compute/particles.py", line 70, in init
    self.N_input = int(self[self.blob_key + '/shape'][0,0])
IndexError: index 0 is out of bounds for axis 0 with size 0

at the same point of job progress every time you clone the failed job and run the failed job’s clone?

sjcalise · July 23, 2024, 9:20pm

Not sure, I’ll test and report back

sjcalise · July 25, 2024, 5:44pm

Hi @wtempel, I cloned and re-ran the job, it failed at a different point with the same error. The first run failed at 1% progress into computing empirical dose weights, the cloned second run failed at 92% progress into motion correcting particles (oof, it almost finished).

Here are the settings I used for the job, in case it’s helpful:
Save results in 16-bit floating point ON
Skip movies with wrong frame count ON
Fourier crop to box size 400 ON (super-res movies)
Parallelized over 4 GPUs

All other settings were default

sjcalise · July 31, 2024, 5:47pm

I cloned the second run and tried again with slicing_gpu_is_worker = OFF just to test, and it ended up getting the same error at the same point of job progress as the second run (92% of motion correction).

So now I have:
First run (J163) - failed at 1% into computing empirical dose weights
Second run (J165) - failed at 92% into motion correcting particles
Third run (J170) - failed at 92% into motion correcting particles

I’ve uploaded the log files for all 3 jobs at GitHub - sjcalise/RBMC: Troubleshooting reference-based motion correction in cryosparc in case it helps to take a look at them.

hsnyder · August 26, 2024, 5:04pm

Hi @sjcalise, sorry for the delay in responding to this and thanks for providing logs. While I can’t tell with total certainty, my guess would be that you have a specific particle stack that either has corrupt metadata, or was taken from too early in the processing pipeline (i.e. before extraction). Did you connect several distinct particle sets to the RBMC job?

The empirical dose weighting steps selects a random movie subset on which to operate, which is why it could produce the same failure, which normally would always occur in the same place in the motion correction step.

–Harris

sjcalise · September 4, 2024, 10:11pm

Hi Harris, no worries about the delay. I never figured this out, but it’s possible there was corrupt metadata in a specific particle stack I was using. I didn’t use any particles from before extraction, and only connected one particle set to the RBMC job. In the last few weeks I have done some additional classification with this dataset and reduced my particle # down to ~290k particles, and successfully run an RBMC with these particles.