Inefficient data access during 2D Class on Lustre

bsobol · March 14, 2023, 4:55pm

Dear cryoSAPRC Team,

We are facing significantly increased processing times in certain job types (mainly 2D Class, but this applies also to other jobs with the SSD cache option, which we, unfortunately, cannot use on this specific computing cluster).

After some investigation, it comes down to the Lustre filesystem and cryoSPARC data access strategy.

We found out that during the 2D classification, cryoSPARC:

Reads 8KB at the beginning of the file - which is stored in Lustre MDS
Then it reads 1KB and 256KB from the middle of the file - stored in OST

The first part is fast - it took about 2 seconds to process 2500 files.
However, reading that 256 kilobytes can take around 13 minutes for the same 2500 files using dd.

And the scheme is repeated multiple times in every classification iteration.

The result is that performance of the 2D class is strongly bottlenecked by the reads and the job that should finish in about 1 hour (based on official cryosparc benchmarks and our experience) can last up to 6 hours when the filesystem is under heavy load.

Did you consider introducing some optimization into this reading scheme (e.g. parallel reads, caching)?

HDD-based distributed filesystems are still a core of storage systems in many (most of?) HPC centers and any improvement in this area would highly benefit the efficiency of resource utilization.

For tests, we used cryoSPARC 4.1.2 and the benchmark dataset (https://guide.cryosparc.com/setup-configuration-and-management/cryosparc-on-aws/performance-benchmarks).

August · March 15, 2023, 7:17pm

I also am using a Lustre file system for our HPC cluster and experiencing slow cryosparc jobs. I made a post about GPUs being underutilized that would be explained by this. Thanks for finding the bottleneck!

sdawood · August 9, 2023, 1:59pm

Thanks for the request and apologies for the delay! We’ve added this to our to-do list

- Suhail