The speed improvements in 4.6 sound great, but so far on two of our systems we have not been able to reproduce them consistently.
E.g., a 3D-classification job that took 3h44m in 4.5.1 is now taking 3h56m on 4.6 (even with less total iterations).
More concerning though is that our 3D classification jobs do not appear to be giving the same results.
For example, I have a two class classification, which converged to a roughly 55/45 split in 4.5.1 after 639 O-EM and 16 full iterations, with clear visual differences between the classes. In a cloned job run in 4.6, it now abruptly collapses into a single class at ~checkpoint 33, and ends with 99.9% of particles in a single class after 639 O-EM and 2 full iterations.
I thought perhaps this was just stochastic variation, but when I run the same job with the same random seed used in 4.5.1, I get the same results - total class collapse.
I should note that we are seeing some improvements in speed for NU-refine, but more on the order of 1.1-1.2X speedup, not 2x (for RTX-4090/threadripper system).
There were no substantial changes to 3D class between v4.5.1 and v4.6, but we’re continuing to investigate to make sure we haven’t missed anything. Generally, 3D class (and any clustering method) will be quite sensitive to initialization so it’s not unexpected that between runs there would be enough difference that in some runs a cluster would be missed (i.e., collapse). We’ve noticed that this is especially true when Force hard classification is on – is it in this case?
We are also looking into the fact that results were different with the same random seed.
Finally, just to double check, could you post/DM me the event log (with images) for the collapsing job?
Force hard classification was off, and I had run multiple classifications with the same settings and minor tweaks in target res in 4.5.1, and never saw this, which makes me a little suspicious, especially since both runs in 4.6 are showing this (the cloned one and cloned with same seed). Will send logs via DM, thanks again for taking a look!
On our other 4.6 system (2080Ti/Xeon), 3D classification is working as expected at least from a couple of test runs, but we are not seeing significant speed increases (e.g. 1hr 37 vs 1hr 46 for a cloned run with identical random seed)
I’ll let Valentin address the questions around 3D classification, but I can address the concerns about speed. The speedups in 4.6 are entirely due to the speed of reading particles from the SSD cache. The level of speedup depends significantly on the type and performance of the SSD(s) that are present for caching particles. In our tests we also see a ~1.2-1.5x speedup on a system with a slower SSD (see System B here). Could you post the model and capacity of your threadripper system’s cache SSD, as well as how full it currently is?
That said, we’ve never observed 4.6 to be meaningfully slower than earlier versions… Regarding the two 3d classification jobs you mention in your second paragraph, are you sure the comparison is fair with regards to particle caching time being accounted for, and that the system was otherwise idle during both runs?
Yes, the particles were already in the cache in both cases, and activity on the systems was comparable as far as I could tell. However I suspect you are right that there may have been something else going on, as the job with the same random seed completed in 2hr10min, vs 3hr44m in 4.5.1 (albeit with class collapse etc as mentioned above).
Alright. Keep us posted if there are any speed regressions. Regarding SSDs in general:
Consumer SSDs tend to get slower when full. You may be able to get some more speed by clearing off some space and issuing a TRIM to the drive. Some people use a smaller partition than the drive’s capacity to help counteract this. This is less of a problem on enterprise drives (e.g. U.2 form factor)
Previously, the differences in SSD performance didn’t show up very significantly in overall job runtimes. Now, there’s a large difference. Our best results were on a RAID-0 array of PCIe Gen 4 SSDs. If you’re using a single Gen 3 SSD (for example), upgrading to a raid0 of gen4 drives could be a significant benefit now (comparable to a major GPU upgrade, etc).
There’s some interaction with CPU performance as well. Threadrippers (3000+) are very fast, but if your older system is running Broadwell Xeons (as an example), I wouldn’t recommend investing in an SSD upgrade.
Thanks Harris will keep this in mind! Re the SSD, it is a 3.84TB M.2 NVMe SSD, approx half full, not sure of the exact model but it was from Jan 2023 if that helps.
Out of interest, why do you suggest using a RAID0 array of smaller drives for scratch, rather than a single larger volume?
Sounds good. Knowing whether the drive is Gen3 or Gen4 would be interesting but it’s up to you whether or not you want to check (lspci -v may be able to get you the model number). If it was a recent model in 2023, it’s probably Gen 4.
I just thought of one other thing… what distro version (kernel version, really) are you using? The speedups rely on io_uring, which was only made available in kernel 5.1 IIRC. Old distros will not see nearly the same benefit.
Re: RAID-0… by having two drives, you have double the total bandwidth. RAID can be unfavourable from a latency perspective, but assuming the particle box size isn’t super small, particles are large enough that reading them seems generally bandwidth limited.
OS is CentOS 7, Kernel is 3.10.0-1160.92.1.el7.x86_64… so I guess it is a gen 3 drive and kernel is too old…?
The other system (Ubuntu 20.04) has a newer kernel (5.4.0-187-generic) and different SSD: 02:00.0 Non-Volatile memory controller: Intel Corporation SSD Pro 7600p/760p/E 6100p Series (rev 03) (prog-if 02 [NVM Express])
I thought that model didn’t come in a 3.84TB (“4TB”) variant. Are you sure that’s your scratch drive and not your boot drive? In any event yes that’s a Gen 3 drive, and a modest one at that. The 760p is also a Gen 3 drive.
Linux 3.10 is indeed too old for the most significant speedups, but there should be a slight performance improvement due to the caching of open file descriptors. Unfortunately, Centos7 has the lowest default open fd limit that I’ve ever seen myself, so even that source of performance benefit may be mitigated on very large datasets. (this isn’t correct, caching of in-memory file descriptors gets disabled along with io_uring on non-supporting kernels)
Linux 5.4 supports io_uring, which we rely on for the really high speedups, but the Gen 3 drive will probably hold the system back a bit.
Yes I think you’re right, that was the boot drive! I think (from lsblk -d -o NAME,MODEL) the scratch drive is “Micron_7300_MTFDHBG3T8TDF” for the CentOS system (which does seem to be gen4 based on a cursory google), and “INTEL SSDPEKKF020T8” for the Ubuntu system (which I think is gen3?).
The micron drive is gen 3, but seems to be a pretty good one. The INTEL SSDPEKKF020T8 seems to be a 2TB gen 3 drive, and decidedly less performant than the micron.
Thanks Harris - this has been a very useful discussion.
Is this info (optimal config for SSD caching in CS) summarized somewhere? Might be useful to have a page in the guide describing considerations when configuring a system for CS if it does not already exist, I hadn’t considered the need to have two SSDs in RAID0 for example, or the difference between gen3 & gen4 drives, kernel versions, etc…?
This is not documented anywhere yet, but I agree that it should be. In the past, the IO code in cryosparc wasn’t able to fully leverage very fast SSDs anyway, so it didn’t make much difference. As of 4.6, we can take advantage of SSD performance, making it into a performance critical part (like the GPU and the CPU). The documentation simply has not yet been updated to reflect this fact.
By the way I edited one of my earlier comments - my remark about there being a slight performance improvement due to fd-caching on centos7 even without io_uring was not correct.
Has anyone else seen this 3D classification variability on the new version? Has this change in 3D classification behavior been observed before when changing versions? Can anyone perform the same test with an old 3D class job? Perhaps it’s GPU type related?
@vperetroukhin@olibclarke Thanks Oli for posting this. We’re experiencing the same issue: repeating the old job, which previously split roughly 50/50, now splits 95/5.
@vperetroukhin looked into this using my data and at least in the case I reported, it seemed to be just coincidence - class collapse was happening stochastically between runs in both 4.5.3 and 4.6. Other Class3D jobs have run normally in 4.6, so I don’t believe it is a widespread issue.
CryoSPARC v4.6.1, released today, contains small changes to the random seed usage so that when the random seed parameter is manually overridden, 2D/3D classification and refinement jobs now correctly use the provided value and produce consistent results when repeated, up to GPU floating-point determinism and precision.