Hi,
I have similar problems during 2D classification jobs. It happens very often since I upgrading to v4. Updating to the latest version (4.1.1) did not seem to solve the problem. Here is the job.log file for one of the failed 2D class jobs.
================= CRYOSPARCW ======= 2022-12-26 04:26:31.452796 =========
Project P4 Job J9
Master cryows1 Port 39002
===========================================================================
========= monitor process now starting main process
MAINPROCESS PID 174865
MAIN PID 174865
class2D.run cryosparc_compute.jobs.jobregister
========= monitor process now waiting for main process
========= sending heartbeat
========= sending heartbeat
***************************************************************
Running job J9 of type class_2D
Running job on hostname %s cryows1
Allocated Resources : {'fixed': {'SSD': True}, 'hostname': 'cryows1', 'lane': 'default', 'lane_type': 'node', 'license': True, 'licenses_acquired': 1, 'slots': {'CPU': [0, 1], 'GPU': [0], 'RAM': [0, 1, 2]}, 'target': {'cache_path': '/home/cryosparc_user/cryosparc_cache', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 25431441408, 'name': 'NVIDIA RTX A5500'}, {'id': 1, 'mem': 25434587136, 'name': 'NVIDIA RTX A5500'}, {'id': 2, 'mem': 25434587136, 'name': 'NVIDIA RTX A5500'}, {'id': 3, 'mem': 25434587136, 'name': 'NVIDIA RTX A5500'}], 'hostname': 'cryows1', 'lane': 'default', 'monitor_port': None, 'name': 'cryows1', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], 'GPU': [0, 1, 2, 3], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]}, 'ssh_str': 'cryosparc_user@cryows1', 'title': 'Worker node cryows1', 'type': 'node', 'worker_bin_path': '/home/cryosparc_user/Applications/cryosparc/cryosparc_worker/bin/cryosparcw'}}
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
It seems this Job unresponsive error is more likely to happen when: i) a higher number (400 or more) of 2D classes is used; ii) a higher number of online-EM iterations (40 or 60) is used; iii) larger batchsize (e.g. 400) is used; iv) more than one GPU is used for a 2D classification job; or v) more than one 2D classification jobs are running at the same time.
Any suggestions or fixes are much appreciated.
Thanks.