Heterogenous Refinement Error that leads to failure

Hello All,

I got a similsr error previously on 2D Classification jobs, but it didn’t lead to failure. Now I get the same error on a heterogenous refinement job but it fails. I’m using an HPC and I think that could be behind the issue, but now I’m just confused as to why this doesn’t happen with every job, so maybe I’m misunderstanding something because I’m sure the output of the processing gets stored to my project folder after processing is done. Any advice or input would be appreciated.

please do note that the job referenced in the error as not being accessible is the same job that failed, and i just faced the same issue in an ab initio job as well.

Traceback (most recent call last):
  File "/projappl/project_000000/usrappl/username/cryoSPARC/cryosparc_worker/cryosparc_compute/jobs/runcommon.py", line 2294, in run_with_except_hook
    run_old(*args, **kw)
  File "cryosparc_master/cryosparc_compute/gpu/gpucore.py", line 134, in cryosparc_master.cryosparc_compute.gpu.gpucore.GPUThread.run
  File "cryosparc_master/cryosparc_compute/gpu/gpucore.py", line 135, in cryosparc_master.cryosparc_compute.gpu.gpucore.GPUThread.run
  File "cryosparc_master/cryosparc_compute/engine/engine.py", line 1080, in cryosparc_master.cryosparc_compute.engine.engine.process.work
  File "cryosparc_master/cryosparc_compute/engine/engine.py", line 129, in cryosparc_master.cryosparc_compute.engine.engine.EngineThread.load_image_data_gpu
  File "/projappl/project_000000/usrappl/username/cryoSPARC/cryosparc_worker/cryosparc_compute/particles.py", line 34, in get_original_real_data
    data = self.blob.view()
  File "/projappl/project_000000/usrappl/username/cryoSPARC/cryosparc_worker/cryosparc_compute/blobio/mrc.py", line 145, in view
    return self.get()
  File "/projappl/project_000000/usrappl/username/cryoSPARC/cryosparc_worker/cryosparc_compute/blobio/mrc.py", line 140, in get
    _, data, total_time = prefetch.synchronous_native_read(self.fname, idx_start = self.page, idx_limit = self.page+1)
  File "cryosparc_master/cryosparc_compute/blobio/prefetch.py", line 82, in cryosparc_master.cryosparc_compute.blobio.prefetch.synchronous_native_read
OSError: 

IO request details:
Error ocurred (Invalid argument) at line 680 in mrc_readmic (1) 

The requested frame/particle cannot be accessed. The file may be corrupt, or there may be a mismatch between the file and its associated metadata (i.e. cryosparc .cs file).

filename:    /run/nvme/job_21994067/data/instance_puhti-login14.bullx:39401/links/P1-J27-1718115145/7779f21b39ceafd06d7b9c8063d81cc0bdb3250b.mrc
filetype:    0
header_only: 0
idx_start:   189
idx_limit:   190
eer_upsampfactor: 2
eer_numfractions: 40
num_threads: 6
buffer:      (nil)
buffer_sz:   0
nx, ny, nz:  0 0 0
dtype:       0
total_time:  -1.000000
io_time:     0.000000

2D Classification, ab initio 3D Reconstruction and Heterogeneous Refinement jobs may be run with or without particle caching. So I wonder whether

  1. Affected jobs fail near the beginning, or after some processing has occurred. Please can you post 10 or so lines that precede the Traceback.
  2. The error would still occur if a clone of a job that fails with particle caching would also fail when caching is disabled.
  3. At the time the error occurs, the referenced file exists. For example (on the relevant worker computer), run the command:
    stat /run/nvme/job_21994067/data/instance_puhti-login14.bullx\:39401/links/P1-J27-1718115145/7779f21b39ceafd06d7b9c8063d81cc0bdb3250b.mrc
    
  4. All affected jobs fail at
      File "cryosparc_master/cryosparc_compute/blobio/prefetch.py", line 82, in cryosparc_master.cryosparc_compute.blobio.prefetch.synchronous_native_read
    
  5. /run/nvme/job_21994067/data/ is a local or network filesystem. What is the output of the command:
    stat -f /run/nvme/job_21994067/data/
    

Hello, thank you for your reply. I tried running an NU-refinement job, and it was met with the same fate too.

`[CPU: 4.21 GB]
– Effective number of classes per image: min 1.00 | 25-pct 1.52 | median 2.40 | 75-pct 3.30 | max 4.98

[CPU: 4.21 GB]
– Class 0: 31.04%

[CPU: 4.21 GB]
– Class 1: 23.43%

[CPU: 4.21 GB]
– Class 2: 16.41%

[CPU: 4.21 GB]
– Class 3: 14.34%

[CPU: 4.21 GB]
– Class 4: 14.78%

[CPU: 4.09 GB]
Learning rate 0.200

[CPU: 4.06 GB]
Done iteration 1 in 20.433s. Total time so far 95.656s

[CPU: 4.06 GB]
– Iteration 2

[CPU: 4.06 GB]
Batch size 5000

[CPU: 4.06 GB]
Using Alignment Radius 21.688 (17.042A)

[CPU: 4.06 GB]
Using Reconstruction Radius 33.000 (11.200A)

[CPU: 4.06 GB]
Randomizing assignments for identical classes…

[CPU: 4.06 GB]
Number of BnB iterations 3

[CPU: 4.06 GB]
DEV 0 THR 0 NUM 1500 TOTAL 20.187569 ELAPSED 8.4320249 –

[CPU: 4.10 GB]
Processed 2500.000 images with 5 models in 9.699s.

[CPU: 4.10 GB]
Engine Started.`

  1. I will try it and post the result.
    EDIT: My job is still queueing, but I just noticed there were other jobs with the caching option turned on that didn’t face this error. Namely, a couple of Abinitio jobs and 2D Classification jobs. The only difference between the ones that failed and the ones that didn’t were a couple of parameters.
    -for the ab-initio: I wanted to try 0.5 class similarity to see how that would change my volumes. that failed in comparison to the 0.1 default ones.
  • For the 2D Classification ones, the parameters that were different from the jobs that worked were the 1. circular mask diameter: the one that was OK was 240, the ones that failed were 200 and 210.
  1. the uncertainty factor: 3.5 and 4 - the ones that failed.

  2. batch size: 150 for one of the ones that failed, and 200 for the other and for the one that worked as well.
    There were other jobs that I deleted after they failed in the beginning because I thought it was a one-off thing and the workspace was getting cluttered.

  3. I need to contact the helpdesk of our HPC center to ask how to check that, because that command doesn’t work and I’m not sure how to check for it myself. I will also post the result when I figure out what to do.
    EDIT: I cannot access the NVME anymore since the job has been ended, so I can’t check it.

  4. I checked all the failed jobs, yes they all fail at that same line.

  5. ‘No such file or command’ From my understanding, the nvme is a temporary storage location in the compute nodes that is used during the run, “local fast storage” which is 3600 GiB large. So it’s a network filesystem, I think. These directories are cleaned once the batch job finishes so I think that’s why I can’t check if the referenced file was there, but I’m not too sure so I’ll have to check with the helpdesk on that.

When you contact the help desk, please ask them about the filesystem type and whether the filesystem is shared between multiple computers.

Hello,

  1. I’ll paste the helpdesk’s response here:

“NVME is a local SSD disk system
and
/scratch/project_########/ is shared LUSTRE disk between multiple computers”

They also recommended that I add this line to the worker config.sh file

export CRYOSPARC_SSD_PATH="$LOCAL_SCRATCH"

I will try adding it, and see if this resolves the issue with the caching option turned on, then off.
EDIT: Turns out this line is in the batch script already, so it doesn’t seem to be the problem.

  1. The file I tried to run with the caching turned off failed, and it returned the same error but with a different directory:

[CPU:   4.19 GB]
  -- Class  2:  12.27%

[CPU:   4.19 GB]
  -- Class  3:  17.09%

[CPU:   4.19 GB]
  -- Class  4:  15.60%

[CPU:   4.07 GB]
  Learning rate 0.090

[CPU:   4.00 GB]
Done iteration 8 in 154.354s. Total time so far 1309.187s

[CPU:   4.00 GB]
-- Iteration 9

[CPU:   4.00 GB]
  Batch size 5000 

[CPU:   4.00 GB]
  Using Alignment Radius 21.489 (17.199A)

[CPU:   4.00 GB]
  Using Reconstruction Radius 32.000 (11.550A)

[CPU:   4.00 GB]
  Number of BnB iterations 3

[CPU:   4.00 GB]
   DEV 0 THR 0 NUM 1500 TOTAL 164.82718 ELAPSED 77.088912 --

[CPU:   4.08 GB]
   Processed 2500.000 images with 5 models in 78.332s.

[CPU:   4.08 GB]
  Engine Started.

[CPU:   4.91 GB]
Traceback (most recent call last):
  File "/projappl/project_########/usrappl/username/cryoSPARC/cryosparc_worker/cryosparc_compute/jobs/runcommon.py", line 2294, in run_with_except_hook
    run_old(*args, **kw)
  File "cryosparc_master/cryosparc_compute/gpu/gpucore.py", line 134, in cryosparc_master.cryosparc_compute.gpu.gpucore.GPUThread.run
  File "cryosparc_master/cryosparc_compute/gpu/gpucore.py", line 135, in cryosparc_master.cryosparc_compute.gpu.gpucore.GPUThread.run
  File "cryosparc_master/cryosparc_compute/engine/engine.py", line 1080, in cryosparc_master.cryosparc_compute.engine.engine.process.work
  File "cryosparc_master/cryosparc_compute/engine/engine.py", line 129, in cryosparc_master.cryosparc_compute.engine.engine.EngineThread.load_image_data_gpu
  File "/projappl/project_########/usrappl/username/cryoSPARC/cryosparc_worker/cryosparc_compute/particles.py", line 34, in get_original_real_data
    data = self.blob.view()
  File "/projappl/project_########/usrappl/username/cryoSPARC/cryosparc_worker/cryosparc_compute/blobio/mrc.py", line 145, in view
    return self.get()
  File "/projappl/project_########/usrappl/username/cryoSPARC/cryosparc_worker/cryosparc_compute/blobio/mrc.py", line 140, in get
    _, data, total_time = prefetch.synchronous_native_read(self.fname, idx_start = self.page, idx_limit = self.page+1)
  File "cryosparc_master/cryosparc_compute/blobio/prefetch.py", line 82, in cryosparc_master.cryosparc_compute.blobio.prefetch.synchronous_native_read
OSError: 

IO request details:
Error ocurred (Invalid argument) at line 680 in mrc_readmic (1) 

The requested frame/particle cannot be accessed. The file may be corrupt, or there may be a mismatch between the file and its associated metadata (i.e. cryosparc .cs file).

filename:    /scratch/project_########/VXY/VXY-Processing/CS-cryolive-processing/S1/extract/blob/448/FoilHole_27286497_Data_27280090_20_20240328_022659_fractions_patch_aligned_doseweighted_particles.mrc
filetype:    0
header_only: 0
idx_start:   185
idx_limit:   186
eer_upsampfactor: 2
eer_numfractions: 40
num_threads: 6
buffer:      (nil)
buffer_sz:   0
nx, ny, nz:  0 0 0
dtype:       0
total_time:  -1.000000
io_time:     0.000000

but this time, it’s in a directory that I can access, so I checked if the file is present in this directory, and it is

Interesting. Please can you post the output of the command

ls -l /scratch/project_########/VXY/VXY-Processing/CS-cryolive-processing/S1/extract/blob/448/ | grep -B 10 -A 10 FoilHole_27286497_Data_27280090_20_20240328_022659_fractions_patch_aligned_doseweighted_particles.mrc

Thank you for all your time. This is the output:

[username@puhti-login12 ~]$ ls -l /scratch/project_########/VXY/VXY-Processing/CS-cryolive-processing/S1/extract/blob/448/ | grep -B 10 -A 10 FoilHole_27286497_Data_27280090_20_20240328_022659_fractions_patch_aligned_doseweighted_particles.mrc
-rw-rw-r-- 1 username project_########  5620736 Jun 10 23:15 FoilHole_27286496_Data_27280079_21_20240328_022640_fractions_patch_aligned_doseweighted_particles_6c3d1d116fc94b51995ff4dd0e4b563d.mrc
-rw-rw-r-- 1 username project_######## 201507840 Jun 10 23:13 FoilHole_27286496_Data_27280079_21_20240328_022640_fractions_patch_aligned_doseweighted_particles.mrc
-rw-rw-r-- 1 username project_########    803840 Jun 10 23:15 FoilHole_27286496_Data_27280090_21_20240328_022644_fractions_patch_aligned_doseweighted_particles_ec0334f26f5645a9a37d571cb3f2efc8.mrc
-rw-rw-r-- 1 username project_######## 208733184 Jun 10 23:13 FoilHole_27286496_Data_27280090_21_20240328_022644_fractions_patch_aligned_doseweighted_particles.mrc
-rw-rw-r-- 1 username project_######## 212747264 Jun 10 23:13 FoilHole_27286496_Data_27280105_21_20240328_022637_fractions_patch_aligned_doseweighted_particles.mrc
-rw-rw-r-- 1 username project_########  62620672 Jun 10 23:15 FoilHole_27286497_Data_27276041_20_20240328_022648_fractions_patch_aligned_doseweighted_particles_62006f57cc8c4bb280cf2c3bfe948736.mrc
-rw-rw-r-- 1 username project_######## 142902272 Jun 10 23:13 FoilHole_27286497_Data_27276041_20_20240328_022648_fractions_patch_aligned_doseweighted_particles.mrc
-rw-rw-r-- 1 username project_########   8029184 Jun 10 23:15 FoilHole_27286497_Data_27280079_20_20240328_022655_fractions_patch_aligned_doseweighted_particles_c4736c39775c43a48fe2b16748cb43a5.mrc
-rw-rw-r-- 1 username project_######## 199902208 Jun 10 23:13 FoilHole_27286497_Data_27280079_20_20240328_022655_fractions_patch_aligned_doseweighted_particles.mrc
-rw-rw-r-- 1 username project_########  77071360 Jun 10 23:15 FoilHole_27286497_Data_27280090_20_20240328_022659_fractions_patch_aligned_doseweighted_particles_8aefa52dc4424240833e5c12cd5f6167.mrc
-rw-rw-r-- 1 username project_######## 126845952 Jun 10 23:13 FoilHole_27286497_Data_27280090_20_20240328_022659_fractions_patch_aligned_doseweighted_particles.mrc
-rw-rw-r-- 1 username project_########   2409472 Jun 10 23:15 FoilHole_27286497_Data_27280105_20_20240328_022652_fractions_patch_aligned_doseweighted_particles_0975fda39ce043b9a184c73fc404caa1.mrc
-rw-rw-r-- 1 username project_######## 203113472 Jun 10 23:13 FoilHole_27286497_Data_27280105_20_20240328_022652_fractions_patch_aligned_doseweighted_particles.mrc
-rw-rw-r-- 1 username project_########  36127744 Jun 10 23:15 FoilHole_27286498_Data_27276041_9_20240328_022703_fractions_patch_aligned_doseweighted_particles_0b4732a3e74c4d6e91568e4bd321a04d.mrc
-rw-rw-r-- 1 username project_######## 173409280 Jun 10 23:13 FoilHole_27286498_Data_27276041_9_20240328_022703_fractions_patch_aligned_doseweighted_particles.mrc
-rw-rw-r-- 1 username project_########  7226368 Jun 10 23:15 FoilHole_27286498_Data_27280079_9_20240328_022711_fractions_patch_aligned_doseweighted_particles_11e0613d7c634460b800ff19d85bef3f.mrc
-rw-rw-r-- 1 username project_######## 202310656 Jun 10 23:13 FoilHole_27286498_Data_27280079_9_20240328_022711_fractions_patch_aligned_doseweighted_particles.mrc
-rw-rw-r-- 1 username project_########    803840 Jun 10 23:15 FoilHole_27286498_Data_27280090_9_20240328_022714_fractions_patch_aligned_doseweighted_particles_47e8fd502dc44ca688d8e1e31487a2ab.mrc
-rw-rw-r-- 1 username project_######## 207127552 Jun 10 23:13 FoilHole_27286498_Data_27280090_9_20240328_022714_fractions_patch_aligned_doseweighted_particles.mrc
-rw-rw-r-- 1 username project_########    803840 Jun 10 23:15 FoilHole_27286498_Data_27280105_9_20240328_022707_fractions_patch_aligned_doseweighted_particles_e098964e01a540b280fd8b3f4a6fb01a.mrc
-rw-rw-r-- 1 username project_######## 217564160 Jun 10 23:13 FoilHole_27286498_Data_27280105_9_20240328_022707_fractions_patch_aligned_doseweighted_particles.mrc

This following line was highlighted in red in the output:

-rw-rw-r-- 1 username project_######## 126845952 Jun 10 23:13 FoilHole_27286497_Data_27280090_20_20240328_022659_fractions_patch_aligned_doseweighted_particles.mrc

What do you find when you check the particle stacks for corruption with the Check for NaN values option enabled?

There was no corruption detected. Here is the output:

No corruption detected.

[CPU:  135.2 MB]
--------------------------------------------------------------

[CPU:  135.2 MB]
Compiling job outputs...

[CPU:  135.2 MB]
Passing through outputs for output group particles from input group particles

[CPU:  204.6 MB]
This job outputted results ['blob']

[CPU:  204.6 MB]
  Loaded output dset with 1161345 items

[CPU:  204.6 MB]
Passthrough results ['alignments2D', 'ctf', 'location', 'pick_stats']

[CPU:   2.71 GB]
  Loaded passthrough dset with 1161345 items

[CPU:   2.72 GB]
  Intersection of output and passthrough has 1161345 items

[CPU:   2.72 GB]
  Output dataset contains:  ['alignments2D', 'ctf', 'location', 'pick_stats']

[CPU:   2.72 GB]
  Outputting passthrough result alignments2D

[CPU:   2.72 GB]
  Outputting passthrough result ctf

[CPU:   2.72 GB]
  Outputting passthrough result location

[CPU:   2.72 GB]
  Outputting passthrough result pick_stats

[CPU:   2.40 GB]
Checking outputs for output group particles

[CPU:   2.66 GB]
Updating job size...

[CPU:   2.66 GB]
Exporting job and creating csg files...

[CPU:   2.66 GB]
***************************************************************

[CPU:   2.66 GB]
Job complete. Total time 7607.87s