Size of particle set miscalculated

AndreGraca · May 5, 2023, 10:25am

Hi!

Very strange situation here! I am using the v4.2.1 with the latest patch 230427

I have a large particle set which is 1.5TB in size and my cache is a 1TB NVMe drive, so for the time being I have been working without caching the particle set to the SSD.

Now I have segregated the initial particle set in 3 with an ab-initio reconstruction job and I wanted to run individual refinements on all of them. I assumed that the sub-particle sets would fit in the SSD so I turned on SSD cache for those jobs. The jobs fail right after the job calculates the particle set size and the strangest is that any of the jobs reports a (sub-)particle set size that is very similar to the size of the original particle set. How is this possible? (The same happens if I try to run 2D classification on those particle stacks).

2D classification job with a particle set of 577868 particles:

Using random seed of 1709122616
Loading a ParticleStack with 577868 items…
SSD cache : cache successfully synced in_use
SSD cache : cache successfully synced, found 0.00MB of files on SSD.
SSD cache : cache successfully requested to check 16291 files.
SSD cache : cache requires 1474728.09MB more on the SSD for files to be downloaded.
Traceback (most recent call last):
File “cryosparc_master/cryosparc_compute/run.py”, line 96, in cryosparc_compute.run.main
File “cryosparc_master/cryosparc_compute/jobs/class2D/run.py”, line 63, in cryosparc_compute.jobs.class2D.run.run_class_2D
File “/home/angr5008/Software/cryosparc/cryosparc_worker/cryosparc_compute/particles.py”, line 114, in read_blobs
u_blob_paths = cache.download_and_return_cache_paths(u_rel_paths)
File “/home/angr5008/Software/cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache.py”, line 115, in download_and_return_cache_paths
delete_cache_files(instance_id, worker_hostname, ssd_cache_path, cache_reserve_mb, cache_quota_mb, used_mb, need_mb)
File “/home/angr5008/Software/cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache.py”, line 317, in delete_cache_files
assert need_mb <= total_mb, (
AssertionError: SSD cache needs 1474728MB but drive can only be filled up to 927797MB; please disable SSD cache for this job.

2D classification job with a subset of the initial particle set of 200003 particles:

Using random seed of 248115048
Loading a ParticleStack with 200003 items…
SSD cache : cache successfully synced in_use
SSD cache : cache successfully synced, found 0.00MB of files on SSD.
SSD cache : cache successfully requested to check 16224 files.
SSD cache : cache requires 1473122.26MB more on the SSD for files to be downloaded.
Traceback (most recent call last):
File “cryosparc_master/cryosparc_compute/run.py”, line 96, in cryosparc_compute.run.main
File “cryosparc_master/cryosparc_compute/jobs/class2D/run.py”, line 63, in cryosparc_compute.jobs.class2D.run.run_class_2D
File “/home/angr5008/Software/cryosparc/cryosparc_worker/cryosparc_compute/particles.py”, line 114, in read_blobs
u_blob_paths = cache.download_and_return_cache_paths(u_rel_paths)
File “/home/angr5008/Software/cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache.py”, line 115, in download_and_return_cache_paths
delete_cache_files(instance_id, worker_hostname, ssd_cache_path, cache_reserve_mb, cache_quota_mb, used_mb, need_mb)
File “/home/angr5008/Software/cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache.py”, line 317, in delete_cache_files
assert need_mb <= total_mb, (
AssertionError: SSD cache needs 1473122MB but drive can only be filled up to 927797MB; please disable SSD cache for this job.

The job is requesting virtually the same amount of cache for both jobs, even though the second job has less than half of the particles from the first job here presented.

I would appreciate help or any feedback in this matter urgently.

Thank you,
André

mmclean · May 5, 2023, 12:47pm

Hi @AndreGraca,

Thanks for the post. You’re observing this behaviour due to the way in which particles are organized into files. During Extract from Micrographs, all of the particles from one micrograph are written into one file, and these are the files that are cached in the downstream ab-initio, 2D class, refinement, etc. jobs. After classifying particles into 3 classes via Ab-initio, it is likely that each of the 3 classes contains a nonzero number of particles from each micrograph, so each particle subset still will need to cache most of the files from the original extraction.

The Restack Particles job was created to solve this problem; it will take in a subset of particles, and re-write the particles into new mrc files that only contain the inputted particles. If you run separate Restack particles on each of your 3 subsets, and then take the outputs and run 2D classifications on each subset independently, you should observe that the amount of space requested during caching is proportional to the particle subset size.

Best,
Michael

AndreGraca · May 9, 2023, 3:50pm

Hi @mmclean!

Thank you for the help.
Right! I had read about the restacking function when it was launched, but I had lost memory of it.
I realised that restacking job fails with

====== Job process terminated abnormally.

when I perform the the whole particle set (200 000 particles), however if I use ‘Particle set tools’ to split the particle set in 4 equal particle sets, the restacking jobs already run well. Is there a limitation that it is known, or is this a problem to be fixed?

ccgauvin94 · May 9, 2023, 8:06pm

I’ve had a similar problem with this job and I noticed it was actually the master node running out of memory. Increasing the size of the swap disk solved it. Might be worth checking?

wtempel · May 10, 2023, 7:35pm

I am told that the job’s current implementation loads “active” batches of particles into RAM.
One may be able to avoid OOM errors by reducing the values of the Num threads and Particle batch size parameters of the Restack Particles job.