We recently implemented a 400G internal network that connects our job servers directly to our storage servers. However, due to our network makeup, the Cryosparc Master is not connected to the 400G internal network, rather to the 10G network we previously used for all connections.
How would you set Cryosparc to transfer the files on the job server rather than pass them through the master? Due to this, we are seeing only 150-200MB/s transfer speeds when we should be seeing closer to 30GB/s
At the beginning of each job, CryoSPARC has to transfer the data from the storage server to the cryosparc scratch folder on the job server. This is where you see the slow transfer speeds. It averages about 200MB/s, but if you try to manually copy the data (using rsync or even cp), you get between 22GB/s and 34GB/s.
The master /etc/hosts may not be significant, but both the job server and storage server have two network interfaces. One storage network, and one public. Our master server is only connected to the public network, not the storage network.
We want CryoSPARC to transfer the data from our storage servers to our job servers over the 400G storage network, and not the 10G public network (which is what the master server can only see).
If it isnât an /etc/hosts issue, then does CryoSPARC just have a hard limit as to the transfer speed allowed?
If with that you mean the copying of particle mrc stacks to the cache, CryoSPARC uses the shutil.copyfile utility running on the worker node without involvement of the master node in the actual copy process.
Was this test was run on the same server, with the same source and destination devices and same source data?
(You mentioned you already did this) Ensure your test resembles what the particle caching system does: copy hundreds to thousands of small (50-200MB) files (use htop or iotop to monitor IO performance).
cd /path/to/cryosparc/cache/device
ls /path/to/folder/with/particle/files/*.mrc | xargs -L1 -P1 -I{} cp {} .
In an upcoming release, weâre adding multi-threaded particle caching. To test whether your system will benefit from this, can you run the same xargs/cp command, but with multiple threads by modifying the value of the -P argument? Please test with various thread counts and see what works best. FWIW, on our system (HDDs + ZFS + NFS + local NVME), we were achieving the same speeds in CryoSPARC as the below xargs/cp command with 2 threads.
e.g., 2 threads, ensure to monitor IO performance with iotop or htop:
sudo bash -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
cd /path/to/cryosparc/cache/device
ls /path/to/folder/with/particle/files/*.mrc | xargs -P2 -I{} cp {} .
Note that a good place to find a folder with particle files is the job folder for an Extract From Micrographs job.
Timing drops significantly the more xargs threads you use. The optimal weâve found is P12.
Our testing was done with 20,460 mrc files.
At P1, you get:
real 38m38.037s
user 0m6.449s
sys 15m1.161s
At P8 you get:
real 19m9.012s
user 0m5.440s
sys 14m2.497s
And at P12 you get:
real 16m30.794s
user 0m4.974s
sys 16m39.701s
Overall, we were seeing between 200MB and 900MB/s on P1, with that jumping to 1.5GB/s to 3.4GB/s on P12.
iostat was used to get xargs performance numbers.
However, in the web UI, it still hovers around 150MB/s and 200MB/s. Using the same files in the CryoSPARC UI takes 52 minutes.
30GB/s is the theoretical max of our bandwidth. Of course, we see less than that due to our SAS3 drives. We are moving to all NVMe in the next two months, so we should get improved results then.
Our current infrastructure is:
Storage server with SAS3 drives
200G Mellanox ConnectX-7
400G switch
Local PCIe 4.0 NVMe for scratch
Thanks for testing that out and reporting your numbers. You should definitely see an increase in caching performance when multi-threaded caching is released in CryoSPARC v4.3.0. You will be able to set the numbers of threads to use via an environment variable (export CRYOSPARC_CACHE_NUM_THREADS=12 in cryosparc_worker/config.sh). The default number of threads used is 2.
Iâd also like to add that in v4.3, along with multithreaded caching, weâve made improvements to the wrapper of the function that copies files- which should make this base rate (i.e. when # threads == 1) increase.
I can report back that the changes have made a huge effect. Both the multi threading and âunder the hoodâ changes have made cache loading much faster. Thanks!
Curious - what changes were made the the wrapper?
In the original version of this wrapper, every time we tried to copy a single file we were doing the following:
getting the filesize of the file being copied using os.path.getsize() - maybe 1 OS call based on the implementation?
getting the dirname of the cache path (os.path.dirname) - 1 OS call?
attempt to create the directory if it doesnât already exist (os.makedirs(path, exist_ok=True))- at least 1 OS call every time, 2 if the path doesnât exist (rare)?
copy the actual file using shutil.copyfile()
In the v4.3.0 implementation, we only call shutil.copyfile() for every file, eliminating the three extra OS calls. This might speed up the base copy rate depending on the filesystem (and how fast the filesystem can access metadata).