Heterogenous Refinement Error that leads to failure

newbie · June 11, 2024, 6:40pm

Hello All,

I got a similsr error previously on 2D Classification jobs, but it didn’t lead to failure. Now I get the same error on a heterogenous refinement job but it fails. I’m using an HPC and I think that could be behind the issue, but now I’m just confused as to why this doesn’t happen with every job, so maybe I’m misunderstanding something because I’m sure the output of the processing gets stored to my project folder after processing is done. Any advice or input would be appreciated.

please do note that the job referenced in the error as not being accessible is the same job that failed, and i just faced the same issue in an ab initio job as well.

Traceback (most recent call last):
  File "/projappl/project_000000/usrappl/username/cryoSPARC/cryosparc_worker/cryosparc_compute/jobs/runcommon.py", line 2294, in run_with_except_hook
    run_old(*args, **kw)
  File "cryosparc_master/cryosparc_compute/gpu/gpucore.py", line 134, in cryosparc_master.cryosparc_compute.gpu.gpucore.GPUThread.run
  File "cryosparc_master/cryosparc_compute/gpu/gpucore.py", line 135, in cryosparc_master.cryosparc_compute.gpu.gpucore.GPUThread.run
  File "cryosparc_master/cryosparc_compute/engine/engine.py", line 1080, in cryosparc_master.cryosparc_compute.engine.engine.process.work
  File "cryosparc_master/cryosparc_compute/engine/engine.py", line 129, in cryosparc_master.cryosparc_compute.engine.engine.EngineThread.load_image_data_gpu
  File "/projappl/project_000000/usrappl/username/cryoSPARC/cryosparc_worker/cryosparc_compute/particles.py", line 34, in get_original_real_data
    data = self.blob.view()
  File "/projappl/project_000000/usrappl/username/cryoSPARC/cryosparc_worker/cryosparc_compute/blobio/mrc.py", line 145, in view
    return self.get()
  File "/projappl/project_000000/usrappl/username/cryoSPARC/cryosparc_worker/cryosparc_compute/blobio/mrc.py", line 140, in get
    _, data, total_time = prefetch.synchronous_native_read(self.fname, idx_start = self.page, idx_limit = self.page+1)
  File "cryosparc_master/cryosparc_compute/blobio/prefetch.py", line 82, in cryosparc_master.cryosparc_compute.blobio.prefetch.synchronous_native_read
OSError: 

IO request details:
Error ocurred (Invalid argument) at line 680 in mrc_readmic (1) 

The requested frame/particle cannot be accessed. The file may be corrupt, or there may be a mismatch between the file and its associated metadata (i.e. cryosparc .cs file).

filename:    /run/nvme/job_21994067/data/instance_puhti-login14.bullx:39401/links/P1-J27-1718115145/7779f21b39ceafd06d7b9c8063d81cc0bdb3250b.mrc
filetype:    0
header_only: 0
idx_start:   189
idx_limit:   190
eer_upsampfactor: 2
eer_numfractions: 40
num_threads: 6
buffer:      (nil)
buffer_sz:   0
nx, ny, nz:  0 0 0
dtype:       0
total_time:  -1.000000
io_time:     0.000000

wtempel · June 12, 2024, 7:44pm

2D Classification, ab initio 3D Reconstruction and Heterogeneous Refinement jobs may be run with or without particle caching. So I wonder whether

Affected jobs fail near the beginning, or after some processing has occurred. Please can you post 10 or so lines that precede the Traceback.
The error would still occur if a clone of a job that fails with particle caching would also fail when caching is disabled.

At the time the error occurs, the referenced file exists. For example (on the relevant worker computer), run the command:

stat /run/nvme/job_21994067/data/instance_puhti-login14.bullx\:39401/links/P1-J27-1718115145/7779f21b39ceafd06d7b9c8063d81cc0bdb3250b.mrc

All affected jobs fail at

  File "cryosparc_master/cryosparc_compute/blobio/prefetch.py", line 82, in cryosparc_master.cryosparc_compute.blobio.prefetch.synchronous_native_read

/run/nvme/job_21994067/data/ is a local or network filesystem. What is the output of the command:
```
stat -f /run/nvme/job_21994067/data/
```

newbie · June 13, 2024, 10:18am

Hello, thank you for your reply. I tried running an NU-refinement job, and it was met with the same fate too.

`[CPU: 4.21 GB]
– Effective number of classes per image: min 1.00 | 25-pct 1.52 | median 2.40 | 75-pct 3.30 | max 4.98

[CPU: 4.21 GB]
– Class 0: 31.04%

[CPU: 4.21 GB]
– Class 1: 23.43%

[CPU: 4.21 GB]
– Class 2: 16.41%

[CPU: 4.21 GB]
– Class 3: 14.34%

[CPU: 4.21 GB]
– Class 4: 14.78%

[CPU: 4.09 GB]
Learning rate 0.200

[CPU: 4.06 GB]
Done iteration 1 in 20.433s. Total time so far 95.656s

[CPU: 4.06 GB]
– Iteration 2

[CPU: 4.06 GB]
Batch size 5000

[CPU: 4.06 GB]
Using Alignment Radius 21.688 (17.042A)

[CPU: 4.06 GB]
Using Reconstruction Radius 33.000 (11.200A)

[CPU: 4.06 GB]
Randomizing assignments for identical classes…

[CPU: 4.06 GB]
Number of BnB iterations 3

[CPU: 4.06 GB]
DEV 0 THR 0 NUM 1500 TOTAL 20.187569 ELAPSED 8.4320249 –

[CPU: 4.10 GB]
Processed 2500.000 images with 5 models in 9.699s.

[CPU: 4.10 GB]
Engine Started.`

I will try it and post the result.
EDIT: My job is still queueing, but I just noticed there were other jobs with the caching option turned on that didn’t face this error. Namely, a couple of Abinitio jobs and 2D Classification jobs. The only difference between the ones that failed and the ones that didn’t were a couple of parameters.
-for the ab-initio: I wanted to try 0.5 class similarity to see how that would change my volumes. that failed in comparison to the 0.1 default ones.

For the 2D Classification ones, the parameters that were different from the jobs that worked were the 1. circular mask diameter: the one that was OK was 240, the ones that failed were 200 and 210.

the uncertainty factor: 3.5 and 4 - the ones that failed.
batch size: 150 for one of the ones that failed, and 200 for the other and for the one that worked as well.
There were other jobs that I deleted after they failed in the beginning because I thought it was a one-off thing and the workspace was getting cluttered.
I need to contact the helpdesk of our HPC center to ask how to check that, because that command doesn’t work and I’m not sure how to check for it myself. I will also post the result when I figure out what to do.
EDIT: I cannot access the NVME anymore since the job has been ended, so I can’t check it.
I checked all the failed jobs, yes they all fail at that same line.
‘No such file or command’ From my understanding, the nvme is a temporary storage location in the compute nodes that is used during the run, “local fast storage” which is 3600 GiB large. So it’s a network filesystem, I think. These directories are cleaned once the batch job finishes so I think that’s why I can’t check if the referenced file was there, but I’m not too sure so I’ll have to check with the helpdesk on that.

wtempel · June 13, 2024, 3:06pm

When you contact the help desk, please ask them about the filesystem type and whether the filesystem is shared between multiple computers.

newbie · June 14, 2024, 4:32pm

Hello,

I’ll paste the helpdesk’s response here:

“NVME is a local SSD disk system
and
/scratch/project_########/ is shared LUSTRE disk between multiple computers”

They also recommended that I add this line to the worker config.sh file

export CRYOSPARC_SSD_PATH="$LOCAL_SCRATCH"

~~I will try adding it, and see if this resolves the issue with the caching option turned on, then off.~~
EDIT: Turns out this line is in the batch script already, so it doesn’t seem to be the problem.

The file I tried to run with the caching turned off failed, and it returned the same error but with a different directory:


[CPU:   4.19 GB]
  -- Class  2:  12.27%

[CPU:   4.19 GB]
  -- Class  3:  17.09%

[CPU:   4.19 GB]
  -- Class  4:  15.60%

[CPU:   4.07 GB]
  Learning rate 0.090

[CPU:   4.00 GB]
Done iteration 8 in 154.354s. Total time so far 1309.187s

[CPU:   4.00 GB]
-- Iteration 9

[CPU:   4.00 GB]
  Batch size 5000 

[CPU:   4.00 GB]
  Using Alignment Radius 21.489 (17.199A)

[CPU:   4.00 GB]
  Using Reconstruction Radius 32.000 (11.550A)

[CPU:   4.00 GB]
  Number of BnB iterations 3

[CPU:   4.00 GB]
   DEV 0 THR 0 NUM 1500 TOTAL 164.82718 ELAPSED 77.088912 --

[CPU:   4.08 GB]
   Processed 2500.000 images with 5 models in 78.332s.

[CPU:   4.08 GB]
  Engine Started.

[CPU:   4.91 GB]
Traceback (most recent call last):
  File "/projappl/project_########/usrappl/username/cryoSPARC/cryosparc_worker/cryosparc_compute/jobs/runcommon.py", line 2294, in run_with_except_hook
    run_old(*args, **kw)
  File "cryosparc_master/cryosparc_compute/gpu/gpucore.py", line 134, in cryosparc_master.cryosparc_compute.gpu.gpucore.GPUThread.run
  File "cryosparc_master/cryosparc_compute/gpu/gpucore.py", line 135, in cryosparc_master.cryosparc_compute.gpu.gpucore.GPUThread.run
  File "cryosparc_master/cryosparc_compute/engine/engine.py", line 1080, in cryosparc_master.cryosparc_compute.engine.engine.process.work
  File "cryosparc_master/cryosparc_compute/engine/engine.py", line 129, in cryosparc_master.cryosparc_compute.engine.engine.EngineThread.load_image_data_gpu
  File "/projappl/project_########/usrappl/username/cryoSPARC/cryosparc_worker/cryosparc_compute/particles.py", line 34, in get_original_real_data
    data = self.blob.view()
  File "/projappl/project_########/usrappl/username/cryoSPARC/cryosparc_worker/cryosparc_compute/blobio/mrc.py", line 145, in view
    return self.get()
  File "/projappl/project_########/usrappl/username/cryoSPARC/cryosparc_worker/cryosparc_compute/blobio/mrc.py", line 140, in get
    _, data, total_time = prefetch.synchronous_native_read(self.fname, idx_start = self.page, idx_limit = self.page+1)
  File "cryosparc_master/cryosparc_compute/blobio/prefetch.py", line 82, in cryosparc_master.cryosparc_compute.blobio.prefetch.synchronous_native_read
OSError: 

IO request details:
Error ocurred (Invalid argument) at line 680 in mrc_readmic (1) 

The requested frame/particle cannot be accessed. The file may be corrupt, or there may be a mismatch between the file and its associated metadata (i.e. cryosparc .cs file).

filename:    /scratch/project_########/VXY/VXY-Processing/CS-cryolive-processing/S1/extract/blob/448/FoilHole_27286497_Data_27280090_20_20240328_022659_fractions_patch_aligned_doseweighted_particles.mrc
filetype:    0
header_only: 0
idx_start:   185
idx_limit:   186
eer_upsampfactor: 2
eer_numfractions: 40
num_threads: 6
buffer:      (nil)
buffer_sz:   0
nx, ny, nz:  0 0 0
dtype:       0
total_time:  -1.000000
io_time:     0.000000

but this time, it’s in a directory that I can access, so I checked if the file is present in this directory, and it is

wtempel · June 14, 2024, 6:18pm

Interesting. Please can you post the output of the command

ls -l /scratch/project_########/VXY/VXY-Processing/CS-cryolive-processing/S1/extract/blob/448/ | grep -B 10 -A 10 FoilHole_27286497_Data_27280090_20_20240328_022659_fractions_patch_aligned_doseweighted_particles.mrc

newbie · June 14, 2024, 6:31pm

Thank you for all your time. This is the output:

[username@puhti-login12 ~]$ ls -l /scratch/project_########/VXY/VXY-Processing/CS-cryolive-processing/S1/extract/blob/448/ | grep -B 10 -A 10 FoilHole_27286497_Data_27280090_20_20240328_022659_fractions_patch_aligned_doseweighted_particles.mrc
-rw-rw-r-- 1 username project_########  5620736 Jun 10 23:15 FoilHole_27286496_Data_27280079_21_20240328_022640_fractions_patch_aligned_doseweighted_particles_6c3d1d116fc94b51995ff4dd0e4b563d.mrc
-rw-rw-r-- 1 username project_######## 201507840 Jun 10 23:13 FoilHole_27286496_Data_27280079_21_20240328_022640_fractions_patch_aligned_doseweighted_particles.mrc
-rw-rw-r-- 1 username project_########    803840 Jun 10 23:15 FoilHole_27286496_Data_27280090_21_20240328_022644_fractions_patch_aligned_doseweighted_particles_ec0334f26f5645a9a37d571cb3f2efc8.mrc
-rw-rw-r-- 1 username project_######## 208733184 Jun 10 23:13 FoilHole_27286496_Data_27280090_21_20240328_022644_fractions_patch_aligned_doseweighted_particles.mrc
-rw-rw-r-- 1 username project_######## 212747264 Jun 10 23:13 FoilHole_27286496_Data_27280105_21_20240328_022637_fractions_patch_aligned_doseweighted_particles.mrc
-rw-rw-r-- 1 username project_########  62620672 Jun 10 23:15 FoilHole_27286497_Data_27276041_20_20240328_022648_fractions_patch_aligned_doseweighted_particles_62006f57cc8c4bb280cf2c3bfe948736.mrc
-rw-rw-r-- 1 username project_######## 142902272 Jun 10 23:13 FoilHole_27286497_Data_27276041_20_20240328_022648_fractions_patch_aligned_doseweighted_particles.mrc
-rw-rw-r-- 1 username project_########   8029184 Jun 10 23:15 FoilHole_27286497_Data_27280079_20_20240328_022655_fractions_patch_aligned_doseweighted_particles_c4736c39775c43a48fe2b16748cb43a5.mrc
-rw-rw-r-- 1 username project_######## 199902208 Jun 10 23:13 FoilHole_27286497_Data_27280079_20_20240328_022655_fractions_patch_aligned_doseweighted_particles.mrc
-rw-rw-r-- 1 username project_########  77071360 Jun 10 23:15 FoilHole_27286497_Data_27280090_20_20240328_022659_fractions_patch_aligned_doseweighted_particles_8aefa52dc4424240833e5c12cd5f6167.mrc
-rw-rw-r-- 1 username project_######## 126845952 Jun 10 23:13 FoilHole_27286497_Data_27280090_20_20240328_022659_fractions_patch_aligned_doseweighted_particles.mrc
-rw-rw-r-- 1 username project_########   2409472 Jun 10 23:15 FoilHole_27286497_Data_27280105_20_20240328_022652_fractions_patch_aligned_doseweighted_particles_0975fda39ce043b9a184c73fc404caa1.mrc
-rw-rw-r-- 1 username project_######## 203113472 Jun 10 23:13 FoilHole_27286497_Data_27280105_20_20240328_022652_fractions_patch_aligned_doseweighted_particles.mrc
-rw-rw-r-- 1 username project_########  36127744 Jun 10 23:15 FoilHole_27286498_Data_27276041_9_20240328_022703_fractions_patch_aligned_doseweighted_particles_0b4732a3e74c4d6e91568e4bd321a04d.mrc
-rw-rw-r-- 1 username project_######## 173409280 Jun 10 23:13 FoilHole_27286498_Data_27276041_9_20240328_022703_fractions_patch_aligned_doseweighted_particles.mrc
-rw-rw-r-- 1 username project_########  7226368 Jun 10 23:15 FoilHole_27286498_Data_27280079_9_20240328_022711_fractions_patch_aligned_doseweighted_particles_11e0613d7c634460b800ff19d85bef3f.mrc
-rw-rw-r-- 1 username project_######## 202310656 Jun 10 23:13 FoilHole_27286498_Data_27280079_9_20240328_022711_fractions_patch_aligned_doseweighted_particles.mrc
-rw-rw-r-- 1 username project_########    803840 Jun 10 23:15 FoilHole_27286498_Data_27280090_9_20240328_022714_fractions_patch_aligned_doseweighted_particles_47e8fd502dc44ca688d8e1e31487a2ab.mrc
-rw-rw-r-- 1 username project_######## 207127552 Jun 10 23:13 FoilHole_27286498_Data_27280090_9_20240328_022714_fractions_patch_aligned_doseweighted_particles.mrc
-rw-rw-r-- 1 username project_########    803840 Jun 10 23:15 FoilHole_27286498_Data_27280105_9_20240328_022707_fractions_patch_aligned_doseweighted_particles_e098964e01a540b280fd8b3f4a6fb01a.mrc
-rw-rw-r-- 1 username project_######## 217564160 Jun 10 23:13 FoilHole_27286498_Data_27280105_9_20240328_022707_fractions_patch_aligned_doseweighted_particles.mrc

This following line was highlighted in red in the output:

-rw-rw-r-- 1 username project_######## 126845952 Jun 10 23:13 FoilHole_27286497_Data_27280090_20_20240328_022659_fractions_patch_aligned_doseweighted_particles.mrc

wtempel · June 14, 2024, 8:59pm

What do you find when you check the particle stacks for corruption with the Check for NaN values option enabled?

newbie · June 14, 2024, 11:58pm

There was no corruption detected. Here is the output:

No corruption detected.

[CPU:  135.2 MB]
--------------------------------------------------------------

[CPU:  135.2 MB]
Compiling job outputs...

[CPU:  135.2 MB]
Passing through outputs for output group particles from input group particles

[CPU:  204.6 MB]
This job outputted results ['blob']

[CPU:  204.6 MB]
  Loaded output dset with 1161345 items

[CPU:  204.6 MB]
Passthrough results ['alignments2D', 'ctf', 'location', 'pick_stats']

[CPU:   2.71 GB]
  Loaded passthrough dset with 1161345 items

[CPU:   2.72 GB]
  Intersection of output and passthrough has 1161345 items

[CPU:   2.72 GB]
  Output dataset contains:  ['alignments2D', 'ctf', 'location', 'pick_stats']

[CPU:   2.72 GB]
  Outputting passthrough result alignments2D

[CPU:   2.72 GB]
  Outputting passthrough result ctf

[CPU:   2.72 GB]
  Outputting passthrough result location

[CPU:   2.72 GB]
  Outputting passthrough result pick_stats

[CPU:   2.40 GB]
Checking outputs for output group particles

[CPU:   2.66 GB]
Updating job size...

[CPU:   2.66 GB]
Exporting job and creating csg files...

[CPU:   2.66 GB]
***************************************************************

[CPU:   2.66 GB]
Job complete. Total time 7607.87s

hsnyder · June 25, 2024, 5:55pm

Hi @newbie,

Sorry for the delay! Let’s try a few more avenues of investigation… Could you please post the output of the following two commands?

file  /scratch/project_########/VXY/VXY-Processing/CS-cryolive-processing/S1/extract/blob/448/FoilHole_27286497_Data_27280090_20_20240328_022659_fractions_patch_aligned_doseweighted_particles.mrc

and

xxd -l 1024  /scratch/project_########/VXY/VXY-Processing/CS-cryolive-processing/S1/extract/blob/448/FoilHole_27286497_Data_27280090_20_20240328_022659_fractions_patch_aligned_doseweighted_particles.mrc

–Harris

newbie · June 25, 2024, 6:14pm

Hello @hsnyder ,
Thank you for your response!

the output of the file command:

/scratch/project_#######/VXY/VXY-Processing/CS-cryolive-processing/S1/extract/blob/448/FoilHole_27286497_Data_27280090_20_20240328_022659_fractions_patch_aligned_doseweighted_particles.mrc: CCP4 Electron Density Map, Little-endian

the output of the xxd -l 1024 command:

00000000: c001 0000 c001 0000 9e00 0000 0200 0000  ................
00000010: 0000 0000 0000 0000 0000 0000 c001 0000  ................
00000020: c001 0000 9e00 0000 cdcc b843 cdcc b843  ...........C...C
00000030: 9959 0243 0000 b442 0000 b442 0000 b442  .Y.C...B...B...B
00000040: 0100 0000 0200 0000 0300 0000 8897 d3c0  ................
00000050: 115a 9c40 9f7c b3ba 0000 0000 0000 0000  .Z.@.|..........
00000060: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000070: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000080: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000090: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000a0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000b0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000c0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000d0: 4d41 5020 4441 0000 0000 0000 0000 0000  MAP DA..........
000000e0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000f0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000100: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000110: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000120: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000130: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000140: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000150: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000160: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000170: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000180: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000190: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000001a0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000001b0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000001c0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000001d0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000001e0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000001f0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000200: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000210: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000220: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000230: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000240: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000250: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000260: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000270: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000280: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000290: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000002a0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000002b0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000002c0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000002d0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000002e0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000002f0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000300: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000310: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000320: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000330: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000340: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000350: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000360: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000370: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000380: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000390: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000003a0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000003b0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000003c0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000003d0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000003e0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000003f0: 0000 0000 0000 0000 0000 0000 0000 0000  ................

hsnyder · June 25, 2024, 6:46pm

Hi @newbie that’s helpful, thanks.

The good news is the file is an MRC file. The bad news is that it only has 158 particles in it, and the error message from your first post indicates that CryoSPARC is trying to read the 190-th frame. Can you describe your processing pipeline upstream of where you got the error? Did you do anything sophisticated regarding merging or splitting particle sets?

Harris

newbie · June 25, 2024, 7:32pm

Hi @hsnyder , thank you!

I’m happy to hear that. I’m still quite a beginner with the software, so I don’t quite understand what you mean by merging or splitting particle sets. If you can give me an example of what might be considered such, I’d be grateful. Might I have done so unintentionally with some settings, perhaps?

I started the processing using CS Live, imported the dataset (previously collected), and did the pre-processing and picking there, where I did the 2d classification and encountered some errors in a few of the 2d classification jobs. then, I exported the 2d classes from the successful job into the main user interface and did a select 2d job. From there, I started ab initio jobs; 2 were completed successfully, and 1 failed, and then subsequent heterogeneous refinement and NU-refinement all failed, with the exception of one NU-refinement job.

The job took one volume of one of the successful ab initio jobs as input.

Please do let me know if there are any more details I can provide or if there’s anything I could explain better.

Thank you.

hsnyder · June 25, 2024, 7:58pm

Hi @newbie. Starting in Live and moving over to regular CryoSPARC is a perfectly reasonable (and common) thing to do, but I’m curious about the other failures you’ve been experiencing. Can you describe the errors you encountered in 2d classification, for instance? what were the messages?

newbie · June 25, 2024, 8:07pm

Hi @hsnyder
Some of them were the same error, ill paste it below. But I had deleted most of the failed jobs to declutter my workspace. (It was a stupid mistake, I thought it was an error that would go away if I restarted the job and I didn’t realize it would be important to try and figure out what the issue is if it turns out to be a recurring problem as it did)

[CPU:  27.75 GB]
Traceback (most recent call last):
  File "/projappl/project_2000889/usrappl/mustafan/cryoSPARC/cryosparc_worker/cryosparc_compute/jobs/runcommon.py", line 2294, in run_with_except_hook
    run_old(*args, **kw)
  File "cryosparc_master/cryosparc_compute/gpu/gpucore.py", line 134, in cryosparc_master.cryosparc_compute.gpu.gpucore.GPUThread.run
  File "cryosparc_master/cryosparc_compute/gpu/gpucore.py", line 135, in cryosparc_master.cryosparc_compute.gpu.gpucore.GPUThread.run
  File "cryosparc_master/cryosparc_compute/engine/engine.py", line 1080, in cryosparc_master.cryosparc_compute.engine.engine.process.work
  File "cryosparc_master/cryosparc_compute/engine/engine.py", line 129, in cryosparc_master.cryosparc_compute.engine.engine.EngineThread.load_image_data_gpu
  File "/projappl/project_2000889/usrappl/mustafan/cryoSPARC/cryosparc_worker/cryosparc_compute/particles.py", line 34, in get_original_real_data
    data = self.blob.view()
  File "/projappl/project_2000889/usrappl/mustafan/cryoSPARC/cryosparc_worker/cryosparc_compute/blobio/mrc.py", line 145, in view
    return self.get()
  File "/projappl/project_2000889/usrappl/mustafan/cryoSPARC/cryosparc_worker/cryosparc_compute/blobio/mrc.py", line 140, in get
    _, data, total_time = prefetch.synchronous_native_read(self.fname, idx_start = self.page, idx_limit = self.page+1)
  File "cryosparc_master/cryosparc_compute/blobio/prefetch.py", line 82, in cryosparc_master.cryosparc_compute.blobio.prefetch.synchronous_native_read
OSError: 

IO request details:
Error ocurred (Invalid argument) at line 680 in mrc_readmic (1) 

The requested frame/particle cannot be accessed. The file may be corrupt, or there may be a mismatch between the file and its associated metadata (i.e. cryosparc .cs file).

filename:    /run/nvme/job_21979108/data/instance_puhti-login14.bullx:39401/links/P1-J20-1718041228/e56454fc75c697eae6dba45f6e92b2ba7caaf880.mrc
filetype:    0
header_only: 0
idx_start:   0
idx_limit:   1
eer_upsampfactor: 2
eer_numfractions: 40
num_threads: 6
buffer:      (nil)
buffer_sz:   0
nx, ny, nz:  0 0 0
dtype:       0
total_time:  -1.000000
io_time:     0.000000

[CPU:   2.92 GB]
Finalizing Job...

[CPU:  22.35 GB]
Traceback (most recent call last):
  File "/projappl/project_2000889/usrappl/mustafan/cryoSPARC/cryosparc_worker/cryosparc_compute/jobs/runcommon.py", line 2294, in run_with_except_hook
    run_old(*args, **kw)
  File "cryosparc_master/cryosparc_compute/gpu/gpucore.py", line 134, in cryosparc_master.cryosparc_compute.gpu.gpucore.GPUThread.run
  File "cryosparc_master/cryosparc_compute/gpu/gpucore.py", line 135, in cryosparc_master.cryosparc_compute.gpu.gpucore.GPUThread.run
  File "cryosparc_master/cryosparc_compute/engine/engine.py", line 1080, in cryosparc_master.cryosparc_compute.engine.engine.process.work
  File "cryosparc_master/cryosparc_compute/engine/engine.py", line 129, in cryosparc_master.cryosparc_compute.engine.engine.EngineThread.load_image_data_gpu
  File "/projappl/project_2000889/usrappl/mustafan/cryoSPARC/cryosparc_worker/cryosparc_compute/particles.py", line 34, in get_original_real_data
    data = self.blob.view()
  File "/projappl/project_2000889/usrappl/mustafan/cryoSPARC/cryosparc_worker/cryosparc_compute/blobio/mrc.py", line 145, in view
    return self.get()
  File "/projappl/project_2000889/usrappl/mustafan/cryoSPARC/cryosparc_worker/cryosparc_compute/blobio/mrc.py", line 140, in get
    _, data, total_time = prefetch.synchronous_native_read(self.fname, idx_start = self.page, idx_limit = self.page+1)
  File "cryosparc_master/cryosparc_compute/blobio/prefetch.py", line 82, in cryosparc_master.cryosparc_compute.blobio.prefetch.synchronous_native_read
OSError: 

IO request details:
Error ocurred (Invalid argument) at line 680 in mrc_readmic (1) 

The requested frame/particle cannot be accessed. The file may be corrupt, or there may be a mismatch between the file and its associated metadata (i.e. cryosparc .cs file).

filename:    /run/nvme/job_21979108/data/instance_puhti-login14.bullx:39401/links/P1-J20-1718041228/e56454fc75c697eae6dba45f6e92b2ba7caaf880.mrc
filetype:    0
header_only: 0
idx_start:   80
idx_limit:   81
eer_upsampfactor: 2
eer_numfractions: 40
num_threads: 6
buffer:      (nil)
buffer_sz:   0
nx, ny, nz:  0 0 0
dtype:       0
total_time:  -1.000000
io_time:     0.000000

And one had an error regarding SSD. I don’t know what changed with that job to cause this issue since the lane setup is the same, but it didn’t occur again, so I’m not sure it’s relevant, but I’ll paste it below anyway. I’m not sure how to sort out such an issue if it does occur again without disabling SSD Caching, though.

[CPU:  21.25 GB]
Traceback (most recent call last):
  File "cryosparc_master/cryosparc_compute/run.py", line 115, in cryosparc_master.cryosparc_compute.run.main
  File "cryosparc_master/cryosparc_compute/jobs/class2D/run_streaming.py", line 191, in cryosparc_master.cryosparc_compute.jobs.class2D.run_streaming.run_class_2D_streaming
  File "cryosparc_master/cryosparc_compute/jobs/class2D/run_streaming.py", line 355, in cryosparc_master.cryosparc_compute.jobs.class2D.run_streaming.prepare_particles
  File "/projappl/project_2000889/usrappl/mustafan/cryoSPARC/cryosparc_worker/cryosparc_compute/particles.py", line 120, in read_blobs
    u_blob_paths = cache_run(u_rel_paths)
  File "/projappl/project_2000889/usrappl/mustafan/cryoSPARC/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 821, in run
    return run_with_executor(rel_sources, executor)
  File "/projappl/project_2000889/usrappl/mustafan/cryoSPARC/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 859, in run_with_executor
    state = drive.allocate(sources, active_run_ids=info["active_run_ids"])
  File "/projappl/project_2000889/usrappl/mustafan/cryoSPARC/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 582, in allocate
    raise RuntimeError(
RuntimeError: SSD cache needs additional 3.44 TiB but drive has 1.73 TiB free. CryoSPARC can only free a maximum of 0.00 B. This may indicate that programs other than CryoSPARC are using the SSD. Please remove any files on the SSD that are outside the /run/nvme/job_21992923/data/instance_puhti-login14.bullx:39401 folder or disable SSD cache for this job.

hsnyder · June 25, 2024, 8:25pm

Hi @newbie, I’ll start with the easier of the two: the second message means more or less exactly what it says… There isn’t enough space on the NVMe drive to use as a cache. You can work around this in two ways:

Check what other programs might be using up space on the nvme and delete anything you can delete
If the first method still won’t get you enough space (or you can’t do it at all), you can disable SSD caching for that job. Unfortunately this will cause the job to run slower - usually a lot slower. But at least you’ll be able to run it. Many jobs have a parameter called something like “Cache particle images on SDD” (usually under “Compute settings”), which you can switch off.

The first error message that you posted is indeed the same phenomenon as what you posted in your initial post. CryoSPARC stores metadata about particles, including the path to the mrc files and the number of particles in each mrc file) in datasets, and those datasets are what it uses to decide which particles to read from disk for actual processing. Somehow the metadata and the actual mrc data have gotten out of sync. I’m still not much closer to understanding the root cause though. I’m going to confer with a colleague but could you confirm that this error you pasted in here is from a 2D class job in live?

I’m tempted to recommend that you re-process the dataset starting from your extracted particles in live, and don’t use any of the particle outputs produced by the 2D class jobs or anything downstream of them, if that’s not too terribly inconvenient.

hsnyder · June 25, 2024, 8:39pm

@newbie, just a follow-up: my colleague agreed that reprocessing was probably the best way to work around this, as it’s not obvious exactly where the particle data and metadata got out of sync. However on further thought, we think it might be best to export the micrographs to regular CryoSPARC and start over with picking. That way we eliminate any possibility of the initial picks being the problem.

hsnyder · June 25, 2024, 8:44pm

@newbie one more question: what version of CryoSPARC are you running?

newbie · June 25, 2024, 8:46pm

Hi @hsnyder ,

I didn’t get the same error again, thankfully, but I’ll keep this in mind if it does recur.
Yes, the 2D classification error is indeed from a 2D class job in live.
I’m not so sure I understand what you mean in the last point. Do you mean that I should re-extract particles using the blob/ring picker and continue the workflow from there? At this point, I’m willing to try anything that you think could help us figure out what the problem is. I’ve been at a standstill for quite some time and I’m quite delayed with my work, so I’ll do start this immediately after you kindly detail a bit more what you’d like me to do.
I also just remembered a detail about the 2D class jobs that I forgot to mention that I think is relevant. In live, I extracted my particles a few times, first using the blob picker. Then, I made 2D classes and did template picking, and then I made 2D classes again. I did a few iterations of this using the ring picker once, and all of this was in live.

Because the picking is all done under the ‘live worker’ job, I’m not sure how I can see which group of picked particles (blob/ring or template, and if template, which one) were used to generate the 2D classes since they all have an input description similar to the image attached below. That’s also kind of why I forgot to mention this point since the individual picking jobs aren’t listed individually in the workspace. I’m not sure if there’s a way to get this information somehow.

newbie · June 25, 2024, 8:49pm

@hsnyder

Should I then export the processed micrographs from live into regular cryosparc and start with blob/ring picking? Did I understand you correctly?

I’m using v4.5.3, but it’s good to note that I’ve gotten this error on the version just prior to this most recent one as well.