I have extracted particles from 2 datasets in relion and loaded into cryoSPARC for further analysis. When I got to the point of using RBMC, RBMC always crash in one of the datasets with error message "
====== Job process terminated abnormally."
After re-running several times and checking job.log, one time the error message was assertion error: nmov > movie_no. This time it showed gain_ref_blob/flip_x: invalid handle 1125899906842637, no heap at index 13 (errno 2: No such file or directory).
I have checked that all soft links to the movies are not broken. I also tried to remove the movie that may cause the crash from processing, but of no avail. Any suggestions to troubleshoot?
The version is v4.4.1. The following is the end of the job.log:
ElectronCountedFramesDecompressor::prepareRead: found 1085 frames in EER-TIFF file.
refmotion worker 0 (NVIDIA RTX A4000)
BFGS iterations: 300
scale (alpha): 4.861825
noise model (sigma2): 39.205906
TIME (s) SECTION
noise model (sigma2): 39.205906
TIME (s) SECTION
0.000145032 sanity
9.167947948 read movie
0.045142473 get gain, defects
0.060268143 read bg
0.002517539 read rigid
0.893805941 prep_movie
0.407442085 extract from frames
0.000571716 extract from refs
0.000000474 adj
0.000000150 bfactor
0.007225606 rigid motion correct
0.000213075 get noise, scale
0.738295133 optimize trajectory
0.155823695 shift_sum patches
0.002365920 ifft
0.001661418 unpad
0.000250805 fill out dataset
0.011622373 write output files
11.495299527 --- TOTAL ---
ElectronCountedFramesDecompressor: reading using TIFF-EER mode.
TIFFReadDirectory: Unknown field with tag 65002 (0xfdea) encountered (repeated a lot of lines)
ElectronCountedFramesDecompressor::prepareRead: found 1085 frames in EER-TIFF file.
gain_ref_blob/flip_x: invalid handle 1125899906842637, no heap at index 13 (errno 2: No such file or directory)
========= main process now complete at 2024-01-12 18:16:10.201676.
========= monitor process now complete at 2024-01-12 18:16:20.703899.
@wtempel here is another set of error messages from job.log in the recent crash:
ElectronCountedFramesDecompressor: reading using TIFF-EER mode.
TIFFReadDirectory: Unknown field with tag 65002 (0xfdea) encountered [repeat a lot of times]
HOST ALLOCATION FUNCTION: using numba.cuda.pinned_array
**** handle exception rc
TIFFReadDirectory: Unknown field with tag 65002 (0xfdea) encountered
TIFFReadDirectory: Unknown field with tag 65002 (0xfdea) encountered
TIFFReadDirectory: Unknown field with tag 65002 (0xfdea) encountered
TIFFReadDirectory: Unknown field with tag 65002 (0xfdea) encountered
TIFFReadDirectory: Unknown field with tag 65002 (0xfdea) encountered
TIFFReadDirectory: Unknown field with tag 65002 (0xfdea) encountered
TIFFReadDirectory: Unknown field with tag 65002 (0xfdea) encountered
TIFFReadDirectory: Unknown field with tag 65002 (0xfdea) encountered
/home/wcyl/cryosparc_worker/cryosparc_compute/jobs/motioncorrection/mic_utils.py:95: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
@jit(nogil=True)
/home/wcyl/cryosparc_worker/cryosparc_compute/micrographs.py:563: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
def contrast_normalization(arr_bin, tile_size = 128):
/home/wcyl/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py:2919: UserWarning: NVRTC log messages whilst compiling kernel:
kernel(963): warning #177-D: variable "Nb2p1" was declared but never referenced
warnings.warn(msg)
/home/wcyl/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/multiprocessing/process.py:108: UserWarning: Kernel function slice_volume called with very small array <cryosparc_compute.gpu.gpuarray.GPUArray object at 0x7f0dd8252310> (size 2). Array will be passed to kernel as a pointer. Consider modifying the kernel to accept individual scalar arguments instead.
self._target(*self._args, **self._kwargs)
[this message repeats a lot of times, where the addresses start 0x7... varies and size varies from 1 to 3]
/home/wcyl/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numpy/ctypeslib.py:518: RuntimeWarning: A builtin ctypes object gave a PEP3118 format string that does not match its itemsize, so a best-guess will be made of the data type. Newer versions of python may behave correctly.
return asarray(obj)
Traceback (most recent call last):
File "cryosparc_master/cryosparc_compute/run.py", line 95, in cryosparc_master.cryosparc_compute.run.main
File "cryosparc_master/cryosparc_compute/jobs/motioncorrection/run_reference_motion.py", line 495, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.run_reference_motion.run_reference_motion_correction
File "cryosparc_master/cryosparc_compute/jobs/motioncorrection/refmotion.py", line 1220, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.refmotion.mainfn_reconstruct
File "cryosparc_master/cryosparc_compute/jobs/motioncorrection/refmotion.py", line 1235, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.refmotion.mainfn_reconstruct
File "cryosparc_master/cryosparc_compute/jobs/motioncorrection/refmotion.py", line 610, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.refmotion.slice_vol
File "/home/wcyl/cryosparc_worker/cryosparc_compute/particles.py", line 70, in init
self.N_input = int(self[self.blob_key + '/shape'][0,0])
IndexError: index 0 is out of bounds for axis 0 with size 0
set status to failed
========= main process now complete at 2024-01-13 22:31:46.834807.
========= monitor process now complete at 2024-01-13 22:31:46.840172.
Hi @kpsleung. Are you still experiencing these issues? And just to clarify, you’re getting different error messages each re-run, even if you just re-run the job without changing any parameters or inputs?
Hi @hsnyder. Running the same job with the same input and parameters gave different error messages. For now, it seems that the problem can be smoothened out by redo extraction and remove any particles on edge.
Hi folks, I have gotten a similar error with RBMC a couple times now. Here’s all the job.log says below. Any thoughts?
**** handle exception rc
set status to failed
Traceback (most recent call last):
File "cryosparc_master/cryosparc_compute/run.py", line 115, in cryosparc_master.cryosparc_compute.run.main
File "cryosparc_master/cryosparc_compute/jobs/motioncorrection/run_reference_motion.py", line 469, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.run_reference_motion.run_reference_motion_correction
File "cryosparc_master/cryosparc_compute/jobs/motioncorrection/refmotion.py", line 1259, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.refmotion.mainfn_reconstruct
File "cryosparc_master/cryosparc_compute/jobs/motioncorrection/refmotion.py", line 1275, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.refmotion.mainfn_reconstruct
File "cryosparc_master/cryosparc_compute/jobs/motioncorrection/refmotion.py", line 637, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.refmotion.slice_vol
File "/opt/applications/cryosparc2_worker/cryosparc_compute/particles.py", line 70, in init
self.N_input = int(self[self.blob_key + '/shape'][0,0])
IndexError: index 0 is out of bounds for axis 0 with size 0
srun: error: emnoded38: task 0: Exited with exit code 1
Hi @wtempel, I cloned and re-ran the job, it failed at a different point with the same error. The first run failed at 1% progress into computing empirical dose weights, the cloned second run failed at 92% progress into motion correcting particles (oof, it almost finished).
Here are the settings I used for the job, in case it’s helpful:
Save results in 16-bit floating point ON
Skip movies with wrong frame count ON
Fourier crop to box size 400 ON (super-res movies)
Parallelized over 4 GPUs
I cloned the second run and tried again with slicing_gpu_is_worker = OFF just to test, and it ended up getting the same error at the same point of job progress as the second run (92% of motion correction).
So now I have:
First run (J163) - failed at 1% into computing empirical dose weights
Second run (J165) - failed at 92% into motion correcting particles
Third run (J170) - failed at 92% into motion correcting particles
Hi @sjcalise, sorry for the delay in responding to this and thanks for providing logs. While I can’t tell with total certainty, my guess would be that you have a specific particle stack that either has corrupt metadata, or was taken from too early in the processing pipeline (i.e. before extraction). Did you connect several distinct particle sets to the RBMC job?
The empirical dose weighting steps selects a random movie subset on which to operate, which is why it could produce the same failure, which normally would always occur in the same place in the motion correction step.
Hi Harris, no worries about the delay. I never figured this out, but it’s possible there was corrupt metadata in a specific particle stack I was using. I didn’t use any particles from before extraction, and only connected one particle set to the RBMC job. In the last few weeks I have done some additional classification with this dataset and reduced my particle # down to ~290k particles, and successfully run an RBMC with these particles.