Apologize for digging up this thread but I have been experiencing this issue recently as well. I have access to multiple nodes on our GPU cluster and sometimes SSD cache stucks on specific nodes but not on others, despite the malfunctioning nodes maybe working yesterday or tomorrow. The malfunctioning nodes are not busy either, as I always verify that no other jobs/programs are running on theses nodes. Different to what has been described already though, there is no heartbeat in my log.
License is valid.
Launching job on lane default target node20.bbsrc ...
Running job on remote worker node hostname node20.bbsrc
[CPU: 90.8 MB Avail: 500.93 GB]
Job J292 Started
[CPU: 90.8 MB Avail: 500.93 GB]
Master running v4.6.2, worker running v4.6.2
[CPU: 91.1 MB Avail: 500.93 GB]
Working in directory: *
[CPU: 91.1 MB Avail: 500.93 GB]
Running on lane default
[CPU: 91.1 MB Avail: 500.93 GB]
Resources allocated:
[CPU: 91.1 MB Avail: 500.93 GB]
Worker: node20.bbsrc
[CPU: 91.1 MB Avail: 500.93 GB]
CPU : [0, 1]
[CPU: 91.1 MB Avail: 500.93 GB]
GPU : [0]
[CPU: 91.1 MB Avail: 500.93 GB]
RAM : [0]
[CPU: 91.1 MB Avail: 500.93 GB]
SSD : True
[CPU: 91.1 MB Avail: 500.92 GB]
--------------------------------------------------------------
[CPU: 91.1 MB Avail: 500.92 GB]
Importing job module for job type homo_abinit...
[CPU: 301.7 MB Avail: 500.77 GB]
Job ready to run
[CPU: 301.7 MB Avail: 500.77 GB]
***************************************************************
[CPU: 301.8 MB Avail: 500.77 GB]
Transparent hugepages are enabled. You may encounter stalls or performance problems with CryoSPARC jobs.
[CPU: 430.1 MB Avail: 500.65 GB]
Using random seed for sgd of 111987125
[CPU: 450.5 MB Avail: 500.63 GB]
Loading a ParticleStack with 220652 items...
[CPU: 450.5 MB Avail: 500.63 GB]
──────────────────────────────────────────────────────────────
SSD cache ACTIVE at * (10 GB reserve)
Checking and allocating files on SSD ...
And here is my log
================= CRYOSPARCW ======= 2025-07-03 14:29:46.612205 =========
Project P12 Job J292
Master *
===========================================================================
MAIN PROCESS PID 3561285
========= now starting main process at 2025-07-03 14:29:46.612645
abinit.run cryosparc_compute.jobs.jobregister
MONITOR PROCESS PID 3561287
========= monitor process now waiting for main process
========= sending heartbeat at 2025-07-03 14:29:50.639151
***************************************************************
Transparent hugepages setting: [always] madvise never
Running job J292 of type homo_abinit
Running job on hostname %s node20.bbsrc
Allocated Resources : {'fixed': {'SSD': True}, 'hostname': 'node20.bbsrc', 'lane': 'default', 'lane_type': 'node', 'license': True, 'licenses_acquired': 1, 'slots': {'CPU': [0, 1], 'GPU': [0], 'RAM': [0]}, 'target': {'cache_path': '*', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 23595909120, 'name': 'NVIDIA A10'}, {'id': 1, 'mem': 23595909120, 'name': 'NVIDIA A10'}, {'id': 2, 'mem': 23595909120, 'name': 'NVIDIA A10'}, {'id': 3, 'mem': 23595909120, 'name': 'NVIDIA A10'}], 'hostname': 'node20.bbsrc', 'lane': 'default', 'monitor_port': None, 'name': 'node20.bbsrc', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127], 'GPU': [0, 1, 2, 3], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63]}, 'ssh_str': '*@node20.bbsrc', 'title': 'Worker node node20.bbsrc', 'type': 'node', 'worker_bin_path': '*'}}
2025-07-03 14:29:59,105 run_with_executor INFO | Resolving 25 source path(s) for caching
2025-07-03 14:29:59,115 run_with_executor INFO | Resolved 25 sources in 0.01 seconds