SSD cache in cluster environment

Hi -
Is there any documentation explaining how the SSD cache works in the HPC cluster environment?
We started getting really weird issues recently and am not sure if this is related to the cryosparc updated (we are running 4.4.1 now) or something else.

The setup is pretty simple - several nodes with local SSD cache under SLURM scheduler. With plenty of space on the SSD storage when multiple jobs land on the same worker we see weird things -

  1. we see really weird numbers with space available -

 SSD cache : cache successfully synced in_use
 SSD cache : cache successfully synced, found 834,778.96 MB of files on SSD.
 SSD cache : cache successfully requested to check 2927 files.
 SSD cache : cache requires 22,050 MB more on the SSD for files to be downloaded.
 SSD cache : cache may not have enough space for download
  Needed        |    22,050.01 MB
  **Available     | -3,399,795.39 MB**
  Disk size     | 7,323,354.00 MB
  Usable space  | 7,000,000.00 MB  (reserve 100,000 MB) (quota 7,000,000 MB)

As you can see, disk size is 7TB, found 800Gb on SSD, but available is … negative.

  1. Cryosparc tries to delete some files off SSD cache. Not sure where does it get the info what files should be deleted. But the files are not there. And we have many lines of
Could not delete non-existing file /SSD/.... 
  1. With 7 jobs on the worker in “running” state I see only 2 actually doing anything. And the rest are stuck on SSD cach this way or another for many hours. Jobs from different projects, different users (from cryosparc perspective). But somehow do not progress trying to delete non-existing files or waiting for the cache files be ulocked, etc…

Eventually it all goes through. No failed jobs. But it is disappointing to see jobs waiting for SSD when SSDs have planty of space and not being used properly.

Thanks for any information on the matter.

Have you asked your HPC support to investigate? Does anything else (RELION) have issues if enabling caching (scratch)?

I am the HPC support :slight_smile:
Other programs don’t have issues. Cryosparc didn’t have that issue before either. It is something new in the way cryosparc manages caches.

I see. I know the feeling. :wink:

Revert to previous CryoSPARC version and see whether it still happens?

Because negative space makes me worry about filesystem corruption.

Hi @filonovd,

IIRC the latest updates to Cryosparc changed slightly how the caching system works with file locking, etc.
In previous versions the caching system could be toggled with a switch. That however required completely deleting the caching folder (instance_master.example.com:portnumber) on each node to avoid issues with files cached with the previous system.

You could try deleting the cache folder entirely on every node, and then retry the jobs? Good luck!

filesystem corruption happened at the same time on several HPC nodes for the local storage? Not likely.
One of the nodes was re-imaged last week with new SSD cache disks, so it is re-initialized but behaves exactly the same.

Andrea -
I can do that. But as you can see above, one of the nodes got a completely new SSD cache storage and still has the same issue.Unfortunately testing is not easy either. It only happens when multiple jobs from different projects hit the same worker at about the same time.

One thing I wanna stress out. Jobs are being stuck for hours, but not forever. Eventually they do run when some other jobs finish. So it is not like job hits some filesystem corruption and can’t continue at all. It is cryosparc master process tells the job not to proceed. And then eventually clears it and then the worker nodes does the job done.

So I just stopped cryosparc, emptied all the caches (both database and files in caches on all cluster nodes), started it back again…
So one node gets a job and starts caching files. It is fine.
Another node on the cluster (!) gets another job from the same project. And …
sees some files locked in the cache. While the cache on that node is completely empty.
So somehow it thinks the cache files are locked by the other node on the cluster. Even though these caches are totally independent.
I do understand that new CryoSPARC has some code to manage shared caches. But we don’t have cache shared. It is local to each node.

@filonovd The logs you first posted suggest that your CryoSPARC workers are not yet using the cache system’s new implementation. You may want to try:

  1. Ensure no jobs using cache are running (maintenance mode may help with this).
  2. Empty workers’ CryoSPARC particle caches.
  3. Inside all the CryoSPARC instance’s cryosparc_worker/config.sh or cryosparc2_worker/config.sh files, include the line
    export CRYOSPARC_IMPROVED_SSD_CACHE=true
    
  4. Resume CryoSPARC processing.

Thank you, Wolfram. Will try that.