Hi there, I have been recently running reference based motion correction on our campus computing cluster. Surprisingly, the performance was really bad–it took about 15 days to extract 400k particles from 10k movies with 4 GPUs–despite that the computing hardware was great, we have A40 cards and 1 TB of memory. In comparison, similar reference based motion correction can be done with our own workstations in a day.
The setup at the computing cluster is not fundamentally wrong. Certain jobs such as ab-initio and non-uniform refinements are lightning fast. This is deeply puzzling. I cannot help not wondering whether is it possible that the codes for the reference based motion correction are not optimized with slurm communication, since it is a relatively new job type?
How are you accessing the filesystem on SLURM, vs on your workstations? RMBC has to go back to the movies, so the data are orders of magnitude larger than most other jobs. Do you see the same discrepancy with motion correction?
Both SLURM and our local workstation use a networked filesystem. I don’t observe slow down with motion correction on our computing cluster, in fact, it only takes 11 second to process one EER movie.
Perhaps interrogating the job.log file may provide some indication of what the issue is. My suspicion would be overheads in the read movie phase.
Note the potential for slow down accessing TIFFs over networked filesystems in general. Motion correction implements a workaround that copies the file into shared memory prior to reading. This does not happen in RBMC. See the response linked below.
The IO can be a bottle neck in RBMC. Thank you for pointing out the current difference in accessing TIFFs between motion correction job and RBMC job!
That said, RBMC jobs on our local cryosparc instance, which also accesses data from a network file server, ran much faster than the campus computing cluster (mentioned in the original post). That is the reason why I suspect the slurm system may be contributing to the sub-optimal performance.
Slurm system is wonderful and I really like how well it handles the job priority, waiting time, user fairness. However, in the case of RBMC, in which presumably individual particles are passed around between different processes, the communication/coordination of particles, depending on how it is implemented, may be a serious overhead.
I am also talking to our campus computing cluster staff to pinpoint the cause.
You make a good point. However, rather than SLURM, could a contributing factor be differences in data access patterns from your local (I assume) workstation compared to the cluster nodes?
Could you clarify this? Do you mean how busy the networked file servers are? If that is the case, the networked file server of our campus computing cluster is almost certainly much busier. I did observe some fluctuation in the daily throughput of the RBMC, between 800 movies and 1000 movies per day, which may be contributed by how busy the cluster was.
The raw data were not being read from the same networked file system. But both networked file system have reasonable IO speed. Using a dd command, the campus networked file system has a 135/833 MB/s write/read speed, while our local networked file system has a 223/1500 MB/s write/read speed.
Since the cryosparc instance on our campus cluster is handled by the center staff, I am asking them to do the log diagnostics and will post the findings when they are available.
In our case, the only solutions that proved effective were either a) reducing the number of frames/stack by switching from raw EER camera frames to discrete TIFF fractionation, or b) switching the NFS source to an SSD array, suggesting that there’s odd pressure being exerted by the random read events.