Harddrive setup for single workstation

I’m at an institution with very little cryo EM experience. A single workstation was bought but never set up (only Linux was installed). As an eager student with cryo EM data and not much time left, here I am.

We have a 1TB NVMe which was set up as the boot drive, and we have a 4TB too. Is it sensible to allocate this purely as the SSD cache?

We also have 6x10TB hard disks that haven’t been mounted yet. I’ve seen this should be in RAID configuration, would you recommend RAID 5, 6 or 10?

Without knowing further details regarding your exact hardware, the following recommendations are based on a system running Ubuntu* (LTS) and not equipped with a RAID controller card** (motherboard BIOS/software RAID is not recommended):

  • Keep NVMe as root/home
  • Use ZFS to set up the 4TB SSD as a “fast” drive, which you can point cryoSPARC to a directory on for caching
  • Use ZFS to set up the HDDs in either RAIDZ1 (equivalent to RAID5) (if using 3 or 4) or RAIDZ2 (equivalent to RAID6) if using all six, as a “slow” drive. Be aware that the ZFS ARC cache will merrily swallow half your RAM, but will free it as necessary.
  • Backups, backups, backups!

*your institute may have specific OS policies, so you may not have a choice. But I’ve had a lot of headaches with RedHat and its derivatives over the years, so generally avoid unless given no choice.

**if equipped with hardware RAID controller (Adaptec or similar) then set the HDD RAID up in that, but make sure to install the management and monitoring software for troubleshooting.

If working with larger datasets, 4TB won’t go very far, but unless you’re throwing around 10,000 micrograph plus datasets, you don’t actually need to go too crazy. I’ve still got a Ryzen 3900X, 128GB dual 1080Ti system in a corner which is actually still pretty good provided you don’t ask it to do huge boxes or 3Dflex. :smiley:

1 Like

Thankyou!
SYSTEM SPECS:
Processor: Intel® Xeon(R) w9-3495X (56 cores, 112 threads, 1.9GHz base, 4.8GHz turbo )
Memory: 256.0 GB
GPU: 2X NVIDIA GA102GL [RTX A6000] (49GB memory each)
OS: Ubuntu 22.04.5 LTS
Storage: 1x 1TB NVMe, 1x 4TB NVMe, 6x10TB hard disk
RAID bus controller: Intel C600/X79 series chipset SATA RAID Controller (rev 11)

On further googling it seems like this RAID is actually software not hardware, but still considered a raid controller card? I will have a look in the BIOS to see if there is a management interface there.

ZFS: I’ve seen some stuff about XFS having faster I/O reads. I haven’t found documentation on the recommended filesystem for CryoSPARC. Is ZFS just much more reliable without much performance tradeoff (assuming I have sufficient RAM).

Is there a drawback to using all 6 drives in RAIDZ2? Or would you use less and keep the others for something like backups?

Backups: do you mean backups of movies or the CryoSPARC projects (or both), and do you mean on the workstation or off? I feel like 60-70TB is not actually too much to play with.

Our expected movie size is apparently 7000 images.

Correct, the chipset RAID is Intel software RAID (fake RAID). Don’t use it; simply because Debian-based distros (at least in my experience) do not “do” Intel RAID cleanly at all; RedHat/Arch/Gentoo do, for some reason.

ZFS is just a lot easier to set up and manage, and the filesystem has been quite robust and trouble-free (at least in my experience since I started deploying it with Ubuntu 20.04). Recommendation, though; assemble the ZFS array using /dev/disk/by-id/ rather than /dev/sd* as the UUIDs do not change if the sd* assignments get shuffled around for some reason (e.g. new drive added or drive fails). It also (apparently) means that the ZFS array can be physically transferred to another system and rebuilt more easily, but I’ve never actually had to try that yet (thankfully). Motherboard (software) RAID dies with the motherboard, if for some reason that needs to be replaced.

RAIDZ2 can be a bit slow (as can RAIDZ3) because of all the parity calculation overhead when writing data. Z3 is painful, even on fast drives.

If purchasing additional storage for backup is not an option, RAIDZ1 with 4 drives and keeping two for offline archival would probably be the simplest strategy.

The RAM seems oddly small for the CPU and GPUs, but I’ve seen that a lot with pre-configured systems for machine learning - lots of cores, big GPUs, meagre RAM provision. 256GB will be fine for most things, but larger boxes, extremely high class counts in 3D or high-res 3D flex might cause problems.

Hi sm,
You can load latest ubuntu 24 and install CS in your root or home (if you are the single user). Your configuration is ok. Just give the path of CS scratch to 4 TB nvme.
You have sufficient amount of storage. You can store the movies in one drive and put cryosparc processing on other drive. Raid is complex and without that also everything works smoothly. But if there are multiple user, then it needs to be set up.
Before installing CS, just install the minimum prerequisites for CS to install and work smoothly.
After install, check cs install with command, if everything is fine.
Let me know if you have any questions.
Best