We installed Cryosparc on an external HPC system with only 500GB local SSD per node for tmpcache.
Since most of our data is larger than this, we cannot really use it and have the data therefore stored on a shared GPFS filesystem.
It turns out that the performance dropped dramatically when having the data on the GPFS file system.
As an example from the very beginning of the process here parts of the output of the same job using the same seed producing the same results running on the same machine, just the difference in input data location:
SSD (6 seconds)
[Wed, 07 Aug 2024 20:04:32 GMT] [CPU RAM used: 317 MB] Computing consensus reconstruction with 17349 particles...
[Wed, 07 Aug 2024 20:04:32 GMT] [CPU RAM used: 317 MB] THR 0 TOTAL 87.202082 ELAPSED 6.0124003 --
Processed 8524 / 8524 particles
[Wed, 07 Aug 2024 20:04:39 GMT] [CPU RAM used: 527 MB] No solvent mask supplied. Generating solvent mask from consensus structure.
GPFS (1168 seconds)
[Wed, 07 Aug 2024 08:03:47 GMT] [CPU RAM used: 297 MB] Computing consensus reconstruction with 17349 particles...
[Wed, 07 Aug 2024 08:03:47 GMT] [CPU RAM used: 297 MB] THR 0 TOTAL 10999.894 ELAPSED 1168.1690 --
Processed 8524 / 8524 particles
[Wed, 07 Aug 2024 08:23:18 GMT] [CPU RAM used: 519 MB] No solvent mask supplied. Generating solvent mask from consensus structure.
That is like a factor 200 speedup using the SSD which seems like a lot to me.
This was only a small test dataset, it is even worse for larger datasets, even thought the comparison is more complicated due to the 500GB SSD limit.
Could there be something going on similar to previous petabyte disk arrays but this time with mrc files?
Any suggestion or ideas how to optimize the performance is highly appreciated
Thanks @mstabrin for posting these details.
Did you observe slowdowns with networked versus local caching specifically for 3D Classification jobs, or for other jobs involving the particle cache (3D ab initio reconstruction, 3D refinement job types_, etc.) also?
In case of the latter, are 3D Classification jobs more severely affected than other job types that use the particle cache?
We also ran now NU refinements with the small dataset.
Using the SSD cache took about 13 minutes, while the GPFS took about 50 minutes.
So I would think it is a general problem with the GPFS and its block size.
Our HPC IT was quite interested in the TIFF mmap case that was mentioned in the other thread.
Our GPFS seems to have a block size of several megabytes causing non-linear access to the file being highly inefficient and trouble with mmap on GPFS seems to be a known issue.
Could it be that the mrc reader does something similar like the tiff reader when CRYOSPARC_TIFF_IO_SHM=false?
And would it be possible to have a similar workaround at the cost of memory efficiency?
Unfortunately, IMO, the results you’re seeing are expected. Networked/clustered file systems are just very slow in applications like this (large number of random small reads). If the nodes you’re working on have a very large amount of RAM you could consider using /dev/shm as your “SSD” cache, but I suspect if the node SSDs are only 500GB, then their RAM isn’t significantly larger than that.
The MRC reading code doesn’t do anything non-obvious the way the TIFF reading code does (as you seem to know, the TIFF reading code has a hack to alleviate horrible network filesystem performance even on sequential reads). Our use of an SSD cache is our attempt to turn a fundamentally random-access workload into a sequential one: we issue large sequential reads to the network filesystem (which they tend to be okay at) copy the particles onto a local SSD with dramatically better latency, and then hit the SSD with the small random reads.
I don’t think there is much else we can do about this, given that many of cryosparc’s algorithms require particles to be individually accessible in a random order. Sorry it’s not better news!