We have CryoSPARC installed on a GPU cluster with five 4-way A100 Dell 8640s as the workers. The nodes are connected to an enterprise SAN Isilon over 10Gbps Ethernet. Jobs are running 1-2x slower than we expect they should. No immediate bottlenecks found on cluster, routers, or storage environment.
Is there a “best way” to connect NFS mounts to workers other than NFSv3? What about suggested MTU settings on routers between workers and storage?
NFSv3 is usually pretty good and much faster than SMB/samba.
Do you have access to the 10Gbps switch or is it all under the DELL stuff ?
In your SAN setup, is it all SSD or is there a scratch setup ? If there is SSD/scratch how fast are particles being written to it ? This could give you an idea of the speeds e.g. 200-400 MB/s would be on the slower side of what I expect.
You can also adjust your /cryosparc_master/config.sh add the line
export CRYOSPARC_CACHE_NUM_THREADS=6
maybe 6, 8, 12, 16, 24 if you have the cores, this could help in some cases.
Thank you for the response. I only have metric data and graphs from the 10G switch unfortunately. As for the worker nodes, they have NvME local scratch. When I installed CryoSPARC, I used the “–no-ssd” option to make it work with NvME. I have added the ‘export’ to the config = 16 and will see how that works out.
Do you know if there are MTUs on routers/switches that need to be set to a specific value (jumbo frames)?
Thank you again!
@wmatthews Are you referring to the
cryosparc_worker/bin/cryosparcw connect --nossd
option? This option would usually disable particle caching unless caching is configured in some other way, such as with the CRYOSPARC_SSD_PATH
variable.
Yes, when I installed the workers, I used the: cryosparcw connect --worker <worker_node> --master <master_node> --nossd option because the software wouldn’t install and wouldn’t recognize the NvME drives as ‘ssd’ on the worker nodes.
The user has used the CryoSPARC software in this configuration with no issues, particularly when the data was coming from a local Isilon. Now we’re feeding the data from an enterprise NFS (same network speed) and the jobs run 1-2x slower.
Is there a configuration in the CryoSPARC software I could check? We have our networks team reviewing graphs for network issues but there don’t seem to be any.
Thank you!
If you have available storage that is significantly faster than the project directory storage you may re-run the cryosparcw connect
command on each worker with the following changes:
- add the
--update
option
- remove the
--nossd
option
- add the
--ssdpath /path/to/cache
option, where /path/to/cache
is a directory on the faster storage
If /path/to/cache
is on shared storage, check which CRYOSPARC_CACHE_LOCK_STRATEGY
applies to your case and should be configured in cryosparc_worker/config.sh
.