Somehow the reading of the raw image in Patch motion correction is for me really slow since we switched to a BeeGFS-Filesystem from a NFS share.
And by really slow i mean like 800-1000s per image when running on 8 GPUs on one node.
CTF-corrections works like a charm in a few secs per image.
Someone has an idea what to do?
We have eer files as raw images.
Compute nodes have 8xNvidia A40, 2x AMD EPYC 74F3, and are connected to the BeeGFS via HDR Infiniband RDMA
Ubuntu 20.04 LTS 5.4.0-91-generic
I tried so far:
cuda 11.4 and 11.5 update 1
cryosparc version 3.2.0 and 3.3.1
following speed tests
root@bert101:/sbdata# dd if=/dev/urandom of=/sbdata/tmpfile bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 37.1599 s, 289 MB/s
root@bert101:/sbdata# sh -c "sync && echo 3 > /proc/sys/vm/drop_caches"
root@bert101:/sbdata# dd if=/sbdata/tmpfile of=/dev/null bs=1M
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 4.28745 s, 2.5 GB/s
root@bert101:/sbdata# dd if=/sbdata/FoilHole_18361953_Data_18358314_18358316_20210818_161635_EER.eer of=/dev/null bs=1M
315+1 records in
315+1 records out
331058766 bytes (331 MB, 316 MiB) copied, 0.849698 s, 390 MB/s
root@bert101:/sbdata/test# dd if=/sbdata/FoilHole_17732806_Data_17696093_17696095_20211218_075614_EER.eer of=/dev/null bs=1M
842+1 records in
842+1 records out
883190976 bytes (883 MB, 842 MiB) copied, 1.38834 s, 636 MB/s
[CPU: 10.91 GB] -- 2.0: processing 4217 of 4593: J1363/imported/FoilHole_19891399_Data_19870532_19870534_20211220_033029_EER.eer
loading /sbdata/projects/X/X/cryo/cryosparc/P9/J1363/imported/FoilHole_19891399_Data_19870532_19870534_20211220_033029_EER.eer
Loading raw movie data from J1363/imported/FoilHole_19891399_Data_19870532_19870534_20211220_033029_EER.eer ...
Done in 1189.84s
Loading gain data from J1363/imported/20210527_105657_EER_GainReference.gain ...
Done in 0.00s
Processing ...
Done in 1.85s
Completed rigid and patch motion with (Z:5,Y:6,X:6) knots
Writing non-dose-weighted result to J1377/motioncorrected/007674966914075445712_FoilHole_19891399_Data_19870532_19870534_20211220_033029_EER_patch_aligned.mrc ...
Done in 0.08s
Writing 120x120 micrograph thumbnail to J1377/thumbnails/007674966914075445712_FoilHole_19891399_Data_19870532_19870534_20211220_033029_EER_thumb_@1x.png ...
Done in 0.03s
Writing 240x240 micrograph thumbnail to J1377/thumbnails/007674966914075445712_FoilHole_19891399_Data_19870532_19870534_20211220_033029_EER_thumb_@2x.png ...
Done in 0.00s
Writing dose-weighted result to J1377/motioncorrected/007674966914075445712_FoilHole_19891399_Data_19870532_19870534_20211220_033029_EER_patch_aligned_doseweighted.mrc ...
Done in 0.17s
Writing background estimate to J1377/motioncorrected/007674966914075445712_FoilHole_19891399_Data_19870532_19870534_20211220_033029_EER_background.mrc ...
Done in 0.01s
Writing motion estimates...
Done in 0.01s
Also:
We also noticed that preloading a particle stack on the Compute nodes cache SSDs (2x NVMe PCI4.0, Raid 0) steadily spikes in speed from 20 MB/s to 500+ MB/s.
Particle loading has been resolved, but motion correction still not.
I think it is probably something special about the BeeGFS and how the motion correction reads the eer files. When we used a simple NFS share everything was fine with eer files.
I tried now:
Running same eer files with relion integrated motion correction → works fine and really fast
Converting the eer files to tiff file with relion_convert_to_tiff, then patch motion in cryosparc → works fine, roughly 4 sec/movie
Since the loading raw movie data step is taking so long i guess it is something in the eerdecompressor? The in the cryosparc code linked github repo folder is of course private so i couldn’t look further into what is really happening there or what kind of IO is happening that could be slow on the BeeGFS. Could you maybe tell me what IO-Patterns i have to test speeds for?
Would really love to have this solved, since it kinda also makes cryosparc live useless for us at the moment, unless i would convert all to tiff before, which i rather wouldn’t want to.
For a while now, cryoSPARC live has used the following strategy for TIFF files: read the entire file off of the filesystem into a shared memory region (which, on Linux, show up as files in /dev/shm), then open that file via libtiff from there. Libtiff has a fairly bad access pattern for some network filesystems, so we found this to offer a substantial improvement in many cases - the access pattern matters much less when reading straight from RAM.
More recently, as of patch 220315, we use the same approach for EER files. It’s puzzling to me that reading TIFF files would be fast but EER would not be, as the actual code that reads the file into shared memory is the same in both cases.
The shared memory feature can be disabled via an environment variable (add export CRYOSPARC_TIFF_IO_SHM=false in the worker config.sh) but currently it cannot be disabled separately for EER vs TIFF files. Perhaps check your worker config and make sure that the feature I described isn’t accidentally disabled? Also perhaps check that the patch was applied successfully and that nothing went wrong?
There is no line like this in the workers config.sh, but i added it with true just to be sure its enabled.
export CRYOSPARC_TIFF_IO_SHM=true
I updated to 3.3.2 from 3.3.1 does that not already apply the patches? Else i got an error when trying to check or download patches now.
Master running v3.3.2, worker running v3.3.2
cryosparcuser@kermit103:~/cryosparc_master$ cryosparcm patch --check
Could not get latest patch (status code 404)
No patches available for current release.
To ignore already-installed patches, add the --force flag
cryosparcuser@kermit103:~/cryosparc_master$ cryosparcm patch --download
Could not get latest patch (status code 404)
No patches available for current release.
Apart from that i actually checked and we hadn’t tried it in live, cause i thought its just the same as normal Patch Motion Correction. I tried now the same dataset once in a normal workspace and once in a live session and its true that live actually just works fine also with eer. So the problem only is normal Patch Motion Correction.
I did run live first and stopped it after around 50 images, funny thing is that the following Patch Motion in a normal workspace is just also fast (though not as fast as tiff) for the already in live processed images and after that slowed down to the normal really really slow.
I checked and the live workers were running on a different node then the Patch Motion correction, so i guess its also something with the BeeGFS related that the processed images were still in the cache of the storage nodes?
Is normal Patch Motion correction also supposed to read the file to the shared memory first?
Also iam still not really understanding why reading to shm is so much faster then the BeeGFS mount via RDMA, but ok for me.
Apart from that i actually checked and we hadn’t tried it in live, cause i thought its just the same as normal Patch Motion Correction. I tried now the same dataset once in a normal workspace and once in a live session and its true that live actually just works fine also with eer. So the problem only is normal Patch Motion Correction.
This is correct. Currently, patch motion and live do not use the same I/O code, though fixing that is on our roadmap. What you are describing is what I would expect to see. Apologies for the confusion.
I checked and the live workers were running on a different node then the Patch Motion correction, so i guess its also something with the BeeGFS related that the processed images were still in the cache of the storage nodes?
Yes I think that’s exactly right.
Is normal Patch Motion correction also supposed to read the file to the shared memory first?
As I mentioned above, not currently. However, we do intend to fix this.
Also iam still not really understanding why reading to shm is so much faster then the BeeGFS mount via RDMA, but ok for me.
I’m not sure on this one. Perhaps latency, as opposed to bandwidth, is the main mechanism involved in the slowdown?
Thanks a lot for your time and detailed answers!!!
Looking forward to the fixed normal Patch Motion Correction then!
If we can ever test anything for you related to that/BeeGFS i would be happy to do that
Can we have the same thing you did to the I/O code for patch motion also for local motion correction please? That apperently laods very slow with eer on our file system. Not really used it before with eer datasets so i just noticed now.