Job is unresponsive - no heartbeat received

  1. The previous except was from the eventlog for J134.
  2. That is correct. The process hung there until I finally sent a kill command at [Mon, 07 Oct 2024 21:50:09 GMT]
  3. See below.
ubuntu@ip-10-1-101-39:~$ cryosparcm cli "get_job('P57', 'J134', 'job_type', 'version', 'instance_information', 'params_spec', 'status', 'started_at')"
{'_id': '670184785a550d32e2931083', 'instance_information': {'CUDA_version': '11.8', 'available_memory': '115.46GB', 'cpu_model': 'AMD EPYC 7R32', 'driver_version': '12.2', 'gpu_info': [{'id': 0, 'mem': 23609475072, 'name': 'NVIDIA A10G', 'pcie': '0000:00:1e'}], 'ofd_hard_limit': 131072, 'ofd_soft_limit': 8192, 'physical_cores': 16, 'platform_architecture': 'x86_64', 'platform_node': 'gpu1-g5-dy-g5-8xlarge-1', 'platform_release': '5.15.0-1062-aws', 'platform_version': '#68~20.04.1-Ubuntu SMP Wed May 1 15:24:09 UTC 2024', 'total_memory': '124.46GB', 'used_memory': '7.98GB'}, 'job_type': 'class_2D_new', 'params_spec': {'class2D_K': {'value': 200}, 'class2D_num_full_iter_batch': {'value': 40}, 'class2D_num_full_iter_batchsize_per_class': {'value': 400}}, 'project_uid': 'P57', 'started_at': 'Mon, 07 Oct 2024 16:24:16 GMT', 'status': 'killed', 'uid': 'J134', 'version': 'v4.6.0'}

Another piece of information that might be useful is the observation that I was able to clone the job and have it finish successfully later in the day (with a different random seed). This is J142 and the associated get_job outputs are below. I wonder if the sporadic errors that we observe are somehow caused by cryosparc jobs queueing on a pcluster compute node that is not freshly initialized at the time of job creation (e.g. a previous job finishes and that on-demand node is used for a subsequent job but some resources are not properly freed up). That being said, we do routinely run multiple jobs on the same pcluster compute node sequentially without errors.

ubuntu@ip-10-1-101-39:~$ cryosparcm cli "get_job('P57', 'J142', 'job_type', 'version', 'instance_information', 'params_spec', 'status', 'started_at')"
{'_id': '670429385a550d32e25ebbe4', 'instance_information': {'CUDA_version': '11.8', 'available_memory': '115.53GB', 'cpu_model': 'AMD EPYC 7R32', 'driver_version': '12.2', 'gpu_info': [{'id': 0, 'mem': 23609475072, 'name': 'NVIDIA A10G', 'pcie': '0000:00:1e'}], 'ofd_hard_limit': 131072, 'ofd_soft_limit': 8192, 'physical_cores': 16, 'platform_architecture': 'x86_64', 'platform_node': 'gpu1-g5-dy-g5-8xlarge-2', 'platform_release': '5.15.0-1062-aws', 'platform_version': '#68~20.04.1-Ubuntu SMP Wed May 1 15:24:09 UTC 2024', 'total_memory': '124.46GB', 'used_memory': '7.92GB'}, 'job_type': 'class_2D_new', 'params_spec': {'class2D_K': {'value': 200}, 'class2D_num_full_iter_batch': {'value': 40}, 'class2D_num_full_iter_batchsize_per_class': {'value': 400}}, 'project_uid': 'P57', 'started_at': 'Mon, 07 Oct 2024 18:32:33 GMT', 'status': 'completed', 'uid': 'J142', 'version': 'v4.6.0'}

Thanks again for helping us troubleshoot this!
Evan