CryoSPARC Particle Caching Slow

UCBKurt · June 28, 2023, 7:46pm

We recently implemented a 400G internal network that connects our job servers directly to our storage servers. However, due to our network makeup, the Cryosparc Master is not connected to the 400G internal network, rather to the 10G network we previously used for all connections.

How would you set Cryosparc to transfer the files on the job server rather than pass them through the master? Due to this, we are seeing only 150-200MB/s transfer speeds when we should be seeing closer to 30GB/s

UCBKurt · July 5, 2023, 4:30pm

Is there any update on this? I see that it has been set to “open”

wtempel · July 5, 2023, 7:35pm

Please can you describe in detail the circumstances of the transfers, for example:

at which stage of which kind of job the transfers occur
where in CryoSPARC the transfer rates are reported
workloads under which 30GB/s transfer speeds have been confirmed
the significance of the “Master /etc/hosts” file in this context

UCBKurt · July 5, 2023, 8:00pm

At the beginning of each job, CryoSPARC has to transfer the data from the storage server to the cryosparc scratch folder on the job server. This is where you see the slow transfer speeds. It averages about 200MB/s, but if you try to manually copy the data (using rsync or even cp), you get between 22GB/s and 34GB/s.

The master /etc/hosts may not be significant, but both the job server and storage server have two network interfaces. One storage network, and one public. Our master server is only connected to the public network, not the storage network.

We want CryoSPARC to transfer the data from our storage servers to our job servers over the 400G storage network, and not the 10G public network (which is what the master server can only see).

If it isn’t an /etc/hosts issue, then does CryoSPARC just have a hard limit as to the transfer speed allowed?

wtempel · July 5, 2023, 9:20pm

If with that you mean the copying of particle mrc stacks to the cache, CryoSPARC uses the shutil.copyfile utility running on the worker node without involvement of the master node in the actual copy process.

Was this test was run on the same server, with the same source and destination devices and same source data?

UCBKurt · July 5, 2023, 9:52pm

Yes, the same exact files, with the same source and the same destination device

stephan · July 11, 2023, 9:00pm

Hi @UCBKurt,

Thanks for reporting.
Can you explain your setup a bit more? I’m impressed that the local scratch device can sustain writes at those speeds!

Can you also ensure you’re timing the cp/rsync transfers correctly?

Ensure the buffer cache is empty before starting each of your transfer tests:
sudo bash -c 'sync; echo 1 > /proc/sys/vm/drop_caches'
Include sync in your timing:
```
time cp <command> && sync;
```
(You mentioned you already did this) Ensure your test resembles what the particle caching system does: copy hundreds to thousands of small (50-200MB) files (use htop or iotop to monitor IO performance).
```
cd /path/to/cryosparc/cache/device
ls /path/to/folder/with/particle/files/*.mrc | xargs -L1 -P1 -I{} cp {} .
```

In an upcoming release, we’re adding multi-threaded particle caching. To test whether your system will benefit from this, can you run the same xargs/cp command, but with multiple threads by modifying the value of the -P argument? Please test with various thread counts and see what works best. FWIW, on our system (HDDs + ZFS + NFS + local NVME), we were achieving the same speeds in CryoSPARC as the below xargs/cp command with 2 threads.

e.g., 2 threads, ensure to monitor IO performance with iotop or htop:

sudo bash -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
cd /path/to/cryosparc/cache/device
ls /path/to/folder/with/particle/files/*.mrc | xargs -P2 -I{} cp {} .

Note that a good place to find a folder with particle files is the job folder for an Extract From Micrographs job.

UCBKurt · July 12, 2023, 7:24pm

Hi Stephan,

Timing drops significantly the more xargs threads you use. The optimal we’ve found is P12.

Our testing was done with 20,460 mrc files.

At P1, you get:

real    38m38.037s
user    0m6.449s
sys     15m1.161s

At P8 you get:

real    19m9.012s
user    0m5.440s
sys     14m2.497s

And at P12 you get:

real    16m30.794s
user    0m4.974s
sys     16m39.701s

Overall, we were seeing between 200MB and 900MB/s on P1, with that jumping to 1.5GB/s to 3.4GB/s on P12.

iostat was used to get xargs performance numbers.

However, in the web UI, it still hovers around 150MB/s and 200MB/s. Using the same files in the CryoSPARC UI takes 52 minutes.

30GB/s is the theoretical max of our bandwidth. Of course, we see less than that due to our SAS3 drives. We are moving to all NVMe in the next two months, so we should get improved results then.

Our current infrastructure is:

Storage server with SAS3 drives
200G Mellanox ConnectX-7
400G switch
Local PCIe 4.0 NVMe for scratch

stephan · July 12, 2023, 10:12pm

Hi @UCBKurt,

Thanks for testing that out and reporting your numbers. You should definitely see an increase in caching performance when multi-threaded caching is released in CryoSPARC v4.3.0. You will be able to set the numbers of threads to use via an environment variable (export CRYOSPARC_CACHE_NUM_THREADS=12 in cryosparc_worker/config.sh). The default number of threads used is 2.

stephan · July 13, 2023, 1:47pm

I’d also like to add that in v4.3, along with multithreaded caching, we’ve made improvements to the wrapper of the function that copies files- which should make this base rate (i.e. when # threads == 1) increase.

UCBKurt · July 13, 2023, 4:45pm

That sounds promising! Do you have a general timeline on when 4.3.0 will be released?

martinhallberg · July 14, 2023, 11:04am

200 MB/s base rate over 10Gbps sounds like there is something not quite optimal already there?

simonbrown · September 5, 2023, 11:34pm

I can report back that the changes have made a huge effect. Both the multi threading and ‘under the hood’ changes have made cache loading much faster. Thanks!
Curious - what changes were made the the wrapper?

stephan · September 6, 2023, 4:06pm

In the original version of this wrapper, every time we tried to copy a single file we were doing the following:

getting the filesize of the file being copied using os.path.getsize() - maybe 1 OS call based on the implementation?
getting the dirname of the cache path (os.path.dirname) - 1 OS call?
attempt to create the directory if it doesn’t already exist (os.makedirs(path, exist_ok=True))- at least 1 OS call every time, 2 if the path doesn’t exist (rare)?
copy the actual file using shutil.copyfile()

In the v4.3.0 implementation, we only call shutil.copyfile() for every file, eliminating the three extra OS calls. This might speed up the base copy rate depending on the filesystem (and how fast the filesystem can access metadata).