Sometimes SSD cache gets stuck on specific nodes but not on others

Izana · July 3, 2025, 1:45pm

Apologize for digging up this thread but I have been experiencing this issue recently as well. I have access to multiple nodes on our GPU cluster and sometimes SSD cache stucks on specific nodes but not on others, despite the malfunctioning nodes maybe working yesterday or tomorrow. The malfunctioning nodes are not busy either, as I always verify that no other jobs/programs are running on theses nodes. Different to what has been described already though, there is no heartbeat in my log.

License is valid.

Launching job on lane default target node20.bbsrc ...

Running job on remote worker node hostname node20.bbsrc

[CPU:   90.8 MB  Avail: 500.93 GB]
Job J292 Started

[CPU:   90.8 MB  Avail: 500.93 GB]
Master running v4.6.2, worker running v4.6.2

[CPU:   91.1 MB  Avail: 500.93 GB]
Working in directory: *

[CPU:   91.1 MB  Avail: 500.93 GB]
Running on lane default

[CPU:   91.1 MB  Avail: 500.93 GB]
Resources allocated: 

[CPU:   91.1 MB  Avail: 500.93 GB]
  Worker:  node20.bbsrc

[CPU:   91.1 MB  Avail: 500.93 GB]
  CPU   :  [0, 1]

[CPU:   91.1 MB  Avail: 500.93 GB]
  GPU   :  [0]

[CPU:   91.1 MB  Avail: 500.93 GB]
  RAM   :  [0]

[CPU:   91.1 MB  Avail: 500.93 GB]
  SSD   :  True

[CPU:   91.1 MB  Avail: 500.92 GB]
--------------------------------------------------------------

[CPU:   91.1 MB  Avail: 500.92 GB]
Importing job module for job type homo_abinit...

[CPU:  301.7 MB  Avail: 500.77 GB]
Job ready to run

[CPU:  301.7 MB  Avail: 500.77 GB]
***************************************************************

[CPU:  301.8 MB  Avail: 500.77 GB]
Transparent hugepages are enabled. You may encounter stalls or performance problems with CryoSPARC jobs.

[CPU:  430.1 MB  Avail: 500.65 GB]
Using random seed for sgd of 111987125

[CPU:  450.5 MB  Avail: 500.63 GB]
Loading a ParticleStack with 220652 items...

[CPU:  450.5 MB  Avail: 500.63 GB]
──────────────────────────────────────────────────────────────
SSD cache ACTIVE at * (10 GB reserve)
  Checking and allocating files on SSD ...

And here is my log

================= CRYOSPARCW =======  2025-07-03 14:29:46.612205  =========
Project P12 Job J292
Master *
===========================================================================
MAIN PROCESS PID 3561285
========= now starting main process at 2025-07-03 14:29:46.612645
abinit.run cryosparc_compute.jobs.jobregister
MONITOR PROCESS PID 3561287
========= monitor process now waiting for main process
========= sending heartbeat at 2025-07-03 14:29:50.639151
***************************************************************
Transparent hugepages setting: [always] madvise never

Running job  J292  of type  homo_abinit
Running job on hostname %s node20.bbsrc
Allocated Resources :  {'fixed': {'SSD': True}, 'hostname': 'node20.bbsrc', 'lane': 'default', 'lane_type': 'node', 'license': True, 'licenses_acquired': 1, 'slots': {'CPU': [0, 1], 'GPU': [0], 'RAM': [0]}, 'target': {'cache_path': '*', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 23595909120, 'name': 'NVIDIA A10'}, {'id': 1, 'mem': 23595909120, 'name': 'NVIDIA A10'}, {'id': 2, 'mem': 23595909120, 'name': 'NVIDIA A10'}, {'id': 3, 'mem': 23595909120, 'name': 'NVIDIA A10'}], 'hostname': 'node20.bbsrc', 'lane': 'default', 'monitor_port': None, 'name': 'node20.bbsrc', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127], 'GPU': [0, 1, 2, 3], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63]}, 'ssh_str': '*@node20.bbsrc', 'title': 'Worker node node20.bbsrc', 'type': 'node', 'worker_bin_path': '*'}}
2025-07-03 14:29:59,105 run_with_executor    INFO     | Resolving 25 source path(s) for caching
2025-07-03 14:29:59,115 run_with_executor    INFO     | Resolved 25 sources in 0.01 seconds

wtempel · July 3, 2025, 8:45pm

@Izana Given the long time that has passed and the many changes in CryoSPARC since the older topic has been created, I moved your post to this new topic.
Is it a consistent subset of worker nodes where the cache gets stuck?
For an affected worker node, what is the output of the command (after replacing /path/to/cache with the node’s actual cache path

df -hT /path/to/cache

?

Izana · July 4, 2025, 12:19pm

Thank you for your speedy reply!

I don’t think the issue is limited in a particular subset of workers, as I have encountered this problem occasionally with all the 4~6 nodes that I normally use. Or I may be just unlucky.

The output of df is

Filesystem     Type  Size  Used Avail Use% Mounted on
/dev/sda1      xfs   3.5T  3.0T  559G  85% /cache

And I believe this 3.5 T is allocated to myself. I’m not running any other programmes which require SSD caching, no one is running anything on the node either.

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A10                     Off |   00000000:4B:00.0 Off |                    0 |
|  0%   34C    P8             15W /  150W |       5MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A10                     Off |   00000000:65:00.0 Off |                    0 |
|  0%   51C    P0             57W /  150W |    3690MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A10                     Off |   00000000:B1:00.0 Off |                    0 |
|  0%   33C    P8             15W /  150W |       3MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A10                     Off |   00000000:CA:00.0 Off |                    0 |
|  0%   35C    P8             15W /  150W |       3MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

Actually it doesn’t look like my job is running either hhh.

Interestingly, when I tried to recreate the issue just now, I noticed that number of source paths has changed to a more reasonable number in log.

2025-07-04 12:57:21,061 run_with_executor    INFO     | Resolving 17150 source path(s) for caching
========= sending heartbeat at 2025-07-04 12:57:21.644236
========= sending heartbeat at 2025-07-04 12:57:31.663001
2025-07-04 12:57:32,091 run_with_executor    INFO     | Resolved 17150 sources in 11.03 seconds```