Slow loading of raw movie data in Patch motion correction

KiSchnelle · December 22, 2021, 1:21pm

Hello all,

Somehow the reading of the raw image in Patch motion correction is for me really slow since we switched to a BeeGFS-Filesystem from a NFS share.

And by really slow i mean like 800-1000s per image when running on 8 GPUs on one node.
CTF-corrections works like a charm in a few secs per image.

Someone has an idea what to do?

We have eer files as raw images.
Compute nodes have 8xNvidia A40, 2x AMD EPYC 74F3, and are connected to the BeeGFS via HDR Infiniband RDMA
Ubuntu 20.04 LTS 5.4.0-91-generic

I tried so far:

cuda 11.4 and 11.5 update 1
cryosparc version 3.2.0 and 3.3.1
following speed tests

root@bert101:/sbdata# dd if=/dev/urandom of=/sbdata/tmpfile bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 37.1599 s, 289 MB/s

root@bert101:/sbdata# sh -c "sync && echo 3 > /proc/sys/vm/drop_caches"

root@bert101:/sbdata# dd if=/sbdata/tmpfile of=/dev/null bs=1M
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 4.28745 s, 2.5 GB/s

root@bert101:/sbdata# dd if=/sbdata/FoilHole_18361953_Data_18358314_18358316_20210818_161635_EER.eer of=/dev/null bs=1M
315+1 records in
315+1 records out
331058766 bytes (331 MB, 316 MiB) copied, 0.849698 s, 390 MB/s

root@bert101:/sbdata/test# dd if=/sbdata/FoilHole_17732806_Data_17696093_17696095_20211218_075614_EER.eer of=/dev/null bs=1M
842+1 records in
842+1 records out
883190976 bytes (883 MB, 842 MiB) copied, 1.38834 s, 636 MB/s

cheers
Kilian

KiSchnelle · January 3, 2022, 1:49pm

Edit:
Here the job printed information.

[CPU: 10.91 GB]  -- 2.0: processing 4217 of 4593: J1363/imported/FoilHole_19891399_Data_19870532_19870534_20211220_033029_EER.eer
        loading /sbdata/projects/X/X/cryo/cryosparc/P9/J1363/imported/FoilHole_19891399_Data_19870532_19870534_20211220_033029_EER.eer
        Loading raw movie data from J1363/imported/FoilHole_19891399_Data_19870532_19870534_20211220_033029_EER.eer ...
        Done in 1189.84s
        Loading gain data from J1363/imported/20210527_105657_EER_GainReference.gain ...
        Done in 0.00s
        Processing ...
        Done in 1.85s
        Completed rigid and patch motion with (Z:5,Y:6,X:6) knots
        Writing non-dose-weighted result to J1377/motioncorrected/007674966914075445712_FoilHole_19891399_Data_19870532_19870534_20211220_033029_EER_patch_aligned.mrc ...
        Done in 0.08s
        Writing 120x120 micrograph thumbnail to J1377/thumbnails/007674966914075445712_FoilHole_19891399_Data_19870532_19870534_20211220_033029_EER_thumb_@1x.png ...
        Done in 0.03s
        Writing 240x240 micrograph thumbnail to J1377/thumbnails/007674966914075445712_FoilHole_19891399_Data_19870532_19870534_20211220_033029_EER_thumb_@2x.png ...
        Done in 0.00s
        Writing dose-weighted result to J1377/motioncorrected/007674966914075445712_FoilHole_19891399_Data_19870532_19870534_20211220_033029_EER_patch_aligned_doseweighted.mrc ...
        Done in 0.17s
        Writing background estimate to J1377/motioncorrected/007674966914075445712_FoilHole_19891399_Data_19870532_19870534_20211220_033029_EER_background.mrc ...
        Done in 0.01s
        Writing motion estimates...
        Done in 0.01s

Also:
We also noticed that preloading a particle stack on the Compute nodes cache SSDs (2x NVMe PCI4.0, Raid 0) steadily spikes in speed from 20 MB/s to 500+ MB/s.

KiSchnelle · February 24, 2022, 12:37pm

Particle loading has been resolved, but motion correction still not.
I think it is probably something special about the BeeGFS and how the motion correction reads the eer files. When we used a simple NFS share everything was fine with eer files.

I tried now:

Running same eer files with relion integrated motion correction → works fine and really fast
Converting the eer files to tiff file with relion_convert_to_tiff, then patch motion in cryosparc → works fine, roughly 4 sec/movie

Since the loading raw movie data step is taking so long i guess it is something in the eerdecompressor? The in the cryosparc code linked github repo folder is of course private so i couldn’t look further into what is really happening there or what kind of IO is happening that could be slow on the BeeGFS. Could you maybe tell me what IO-Patterns i have to test speeds for?

Would really love to have this solved, since it kinda also makes cryosparc live useless for us at the moment, unless i would convert all to tiff before, which i rather wouldn’t want to.

wtempel · February 25, 2022, 10:34pm

@KiSchnelle Unfortunately, we don’t have experience with the combination of EER data and BeeGFS. We’ll look into it and keep you posted.

KiSchnelle · April 25, 2022, 1:23pm

Just wanted to update that i updated to the latest version since i read this line

Improved EER read performance for some file systems PATCH 220315

But it didnt change anything for us sadly.

cheers
Kilian

hsnyder · April 27, 2022, 3:59pm

Hi @KiSchnelle,

For a while now, cryoSPARC live has used the following strategy for TIFF files: read the entire file off of the filesystem into a shared memory region (which, on Linux, show up as files in /dev/shm), then open that file via libtiff from there. Libtiff has a fairly bad access pattern for some network filesystems, so we found this to offer a substantial improvement in many cases - the access pattern matters much less when reading straight from RAM.

More recently, as of patch 220315, we use the same approach for EER files. It’s puzzling to me that reading TIFF files would be fast but EER would not be, as the actual code that reads the file into shared memory is the same in both cases.

The shared memory feature can be disabled via an environment variable (add export CRYOSPARC_TIFF_IO_SHM=false in the worker config.sh) but currently it cannot be disabled separately for EER vs TIFF files. Perhaps check your worker config and make sure that the feature I described isn’t accidentally disabled? Also perhaps check that the patch was applied successfully and that nothing went wrong?

– Harris

KiSchnelle · April 28, 2022, 1:05pm

Hi @hsnyder ,

Thank you for your reply! There are two things.

There is no line like this in the workers config.sh, but i added it with true just to be sure its enabled.

export CRYOSPARC_TIFF_IO_SHM=true

I updated to 3.3.2 from 3.3.1 does that not already apply the patches? Else i got an error when trying to check or download patches now.

Master running v3.3.2, worker running v3.3.2

cryosparcuser@kermit103:~/cryosparc_master$ cryosparcm patch --check
Could not get latest patch (status code 404)
No patches available for current release.
To ignore already-installed patches, add the --force flag

cryosparcuser@kermit103:~/cryosparc_master$ cryosparcm patch --download
Could not get latest patch (status code 404)
No patches available for current release.

Apart from that i actually checked and we hadn’t tried it in live, cause i thought its just the same as normal Patch Motion Correction. I tried now the same dataset once in a normal workspace and once in a live session and its true that live actually just works fine also with eer. So the problem only is normal Patch Motion Correction.

I did run live first and stopped it after around 50 images, funny thing is that the following Patch Motion in a normal workspace is just also fast (though not as fast as tiff) for the already in live processed images and after that slowed down to the normal really really slow.

I checked and the live workers were running on a different node then the Patch Motion correction, so i guess its also something with the BeeGFS related that the processed images were still in the cache of the storage nodes?

Is normal Patch Motion correction also supposed to read the file to the shared memory first?

Also iam still not really understanding why reading to shm is so much faster then the BeeGFS mount via RDMA, but ok for me.

cheers
Kilian

hsnyder · April 29, 2022, 4:37pm

Hi Kilian,

It looks like you’re up to date!

Apart from that i actually checked and we hadn’t tried it in live, cause i thought its just the same as normal Patch Motion Correction. I tried now the same dataset once in a normal workspace and once in a live session and its true that live actually just works fine also with eer. So the problem only is normal Patch Motion Correction.

This is correct. Currently, patch motion and live do not use the same I/O code, though fixing that is on our roadmap. What you are describing is what I would expect to see. Apologies for the confusion.

I checked and the live workers were running on a different node then the Patch Motion correction, so i guess its also something with the BeeGFS related that the processed images were still in the cache of the storage nodes?

Yes I think that’s exactly right.

Is normal Patch Motion correction also supposed to read the file to the shared memory first?

As I mentioned above, not currently. However, we do intend to fix this.

Also iam still not really understanding why reading to shm is so much faster then the BeeGFS mount via RDMA, but ok for me.

I’m not sure on this one. Perhaps latency, as opposed to bandwidth, is the main mechanism involved in the slowdown?

–Harris

KiSchnelle · May 2, 2022, 11:46am

@hsnyder

Thanks a lot for your time and detailed answers!!!

Looking forward to the fixed normal Patch Motion Correction then!
If we can ever test anything for you related to that/BeeGFS i would be happy to do that

cheers
Kilian

KiSchnelle · September 20, 2022, 3:29pm

Wanted to just give a quick feedback, that this problem is fixed with the newest patch:)
Thanks a lot!

cheers
Kilian

KiSchnelle · November 29, 2022, 12:52pm

Can we have the same thing you did to the I/O code for patch motion also for local motion correction please? That apperently laods very slow with eer on our file system. Not really used it before with eer datasets so i just noticed now.

cheers
Kilian

hsnyder · November 29, 2022, 6:13pm

Hi @KiSchnelle,

The suggested change is on our roadmap, but unfortunately I can’t promise a specific timeline.

–Harris