Poor cryoSPARC performance when accessing files from petabyte disk arrays

Dear cryoSPARC Developers,

I wonder if you could help us with the poor cryoSPARC performance when accessing files from petabyte disk arrays?

We have new servers with NVIDIA RTX6000 cards. cryoSPARC runs fast with the data kept on local drives and ~20 times slower if it needs to retrieve the flies from GPFS or Lustre file systems.

The affected steps are import movies, motion correction, and CTF estimation. Interestingly, if we start the same import job for the second time, i.e., when the data sits in GPFS/Luster cache, it will run 50-100 times faster than the first time. Our disk arrays engineers looked into the issue and assured me that it is possible to make a relatively trivial code change in cryoSPARC to fix this.

Given the typical size of the current datasets and the proliferation of petabyte disc arrays in the cryoEM related research, this issue could be critical for many cryoSPARC users. I would gladly connect you with our IT department and work closely to resolve this issue for everyone’s benefit.

Thank you,
Sergei

That’s intriguing. Would it be possible for you to ask them to be more specific? I’m very skeptical that anything in the CryoSPARC code could have much of an impact on disk array performance and also that the size of the array is relevant. I’m constantly fighting disk array I/O issues and am looking for solutions myself.

Hi @Sergei,

Thanks for posting! As @pgoetz said, it would be really great to know what the disk array engineers found out - we are keen to fix any issue that can be identified - unfortunately it’s hard for us to test on every type of disk array or file system. If you can provide any more details either here, or via email to feedback@structura.bio that would be great!!

Hi the developers,

I would like to mention this issue also. I am testing cryosparc on our GPFS and Lustre file systems now, as we also find the cryosparc running slower on these parallel filesystem than a local drive, especially on the 2D classification step and 3D init step. More interesting, these two steps mentioned above are running even slower on GPFS than the Lustre filesystem. The GPU seems not be the bottle neck of these two steps and the I/O might be the issue (eg, there are lots of sequential read calls at these two steps at our environment (SSD cache is not enabled)). We are working with IBM team now on this issue, I hope we can discuss this issue in detail with you also. Thanks!

Hi apunjani,

This is Alvin from the HPC team at National Institutes of Health, I am working with Sergei for this I/O performance issue. We are seeing the same problem Mike just mentioned above that the most popular parallel filesystems are much slower than a local drive, and GPFS is worse than Lustre. We have high end storage systems connected through a fast infiniband network, the I/O benchmark showed promising. We have been working with senior engineers from the vendor support for 20+ hours in the past two months, and we got many test data to share for this I/O performance tuning. We have powerful GPU nodes(Nvidia V100, 756GB RAM, 40TB+ SSD, …) and GPFS & Lustre running on one of the best storage systems, so I believe we are not the only cryoSPARC user who is experiencing the problem. Earlier I sent similar message to Feedback feedback@structura.bio and Saara about this, and I would love to schedule a Web meeting with you to discuss the details, I can invite Sergei, and GPFS/Lustre support engineer from the storage end, we have quite a few test data and findings to share, and I can run tests per your instructions on our systems(through Webex/Teams). Please let me know the best way for you. Thank you in advance!

In short what we’ve found is the way the tiff files are read is problematic for HDD media and not optimal for the filesystem. The I/O pattern we observed looks like:

  • read first 8 bytes with posix read() call
  • read the tiff metadata using mmap(), however the I/O requests while somewhat clustered together are not in order. For example (these are 4K LBA offsets within the file): 1230, 1233, 1232, 2464, 2467, 2466. The offsets extend roughly through the end of the file.

This results in a lot of small I/O requests to the storage array. It would be significantly more efficient from the array perspective to read the entire file sequentially even though strictly speaking we don’t need all the data read.

I noticed that the blobio/tiff interface calls the C TIFF API. I also noticed that the TIFFfile python interface has an attribute “use_memmap”. If this is false the entire file is read into memory. The TIFFfile class also allows access (I think) to the TIFF directory structure using the IFD attribute on an instance of TIFFfile. Perhaps the blobio interface could be modified to use the python TIFFfile interface which would read the entire file sequentially resulting in theoretically much higher performance from the disk subsystem.

2 Likes

Hi @aaronknister this is really interesting insight, we’re investigating it now.

We’ve put together an IO benchmarking script that tests reading movie files into memory. It does two tests: First performs basic reading of a file with basic Python IO. Then it uses cryoSPARC’s blobio/TIFF/MRC reading facilities. The test will let us compare the performance between the two strategies.

Here’s how to run it on your system:

# Download to cryosparc-bench.tar.gz
curl https://structura-assets.s3.amazonaws.com/bench/4298a72c96f888d6fbab85f7715bb053f96c713b087fa0ba746493bfe4cfa1f2/cryosparc-bench-2021-05-12.tar.gz -o cryosparc-bench.tar.gz

# Install
mkdir cryosparc-bench
tar -xzf cryosparc-bench.tar.gz -C cryosparc-bench

# Activate environment
cd cryosparc-bench
source env/bin/activate

# View available flags
PYTHONPATH=$(pwd) LD_LIBRARY_PATH="env/lib" python bin/bench_io.py --help

# Clear file system caches and example run
sync && echo 1 > /proc/sys/vm/drop_caches
PYTHONPATH=$(pwd) LD_LIBRARY_PATH="env/lib" python bin/bench_io.py --threads 4 --limit 10 --randomize '/path/to/files/*.tif'

Replace /path/to/files/*.tif with the glob path to micrograph or particle stack files on your system (keep wrapped in single quotes). Run the last two lines at least three times.

For anyone in this thread experiencing this poor IO performance, please try the above instructions on your system and post the output of the last two lines.

3 Likes

Hi @nfrasser and all,

Thanks for discussion and sharing information. Will run the bench test and update the output soon.

The other thing I noticed today is: for a 300 K particles 3D init step, GPFS performances better than Lustre at the first 200 initial iterations before the annealing starts, and then the GPFS performance starts to worse than Lustre. The first 700 (1600 in total) iterations accumulated total time is (every 100 iterations, in sec) GPFS [512,1002,1565,2884,4418,6138,7940]; Lustre [892, 1439, 803, 2283, 2612, 2920, 3212].