OSError: [Errno 28] No space left on device

Hello, I am a cluster administrator. The Cryosparc service we use has multiple users working on it, and our node has a single NVIDIA node and multiple NVIDIA nodes. OSError: [Errno 28] No space left on Device is frequently reported on multiple NVIDIA nodes when the user uses SSD cache.
I suspect that this phenomenon is caused by the fact that when task A clears 1TB of space, task B starts to run and detects that SSD cache space is sufficient, so it starts to copy at the same time. I wonder if there is any way to solve this problem? Thank you very much!

Hi @zhenyuanliu, to help me with further troubleshooting could you tell me the following?

  • What does your worker/ cache setup look like? Does each GPU node have a dedicated cache or is there a shared cache used between two GPUs.
  • What’s the total size of the SSD cache?
  • How many jobs that use the cache do you have running at the same time?

Also try increasing the SSD cache reserve (instructions here; use the --ssdreserve and --update flags) up from the default 10GB to 50 or 100GB. Does that fix the issue?

@nfrasser Hello, my work node has no special Settings for cache, only the cache path is specified. A work node is equipped with 5.8T cache space. In most cases, four jobs using cache will be run at the same time, because each work node is equipped with four Gpus.
What do you recommend if I’m going to use the --ssdreserve parameter? The maximum jobs I’ve seen so far require 1TB of cache space.