Job is unresponsive - no heartbeat received

Hello,

I am hoping in on this troubleshooting page because we are experiencing the same thing. It seems to happen after the latest update. It happens on a variety of jobs, but usually the ones that tend to take a while like NU refine, 2D classification or Reference-based Motion Correction. Following are the outputs you asked for on a Reference-based Motion Correction job that failed yesterday:

[Tue, 01 Oct 2024 16:09:40 GMT]  License is valid.
[Tue, 01 Oct 2024 16:09:40 GMT]  Launching job on lane gpu1-g5-525 target gpu1-g5-525 ...
[Tue, 01 Oct 2024 16:09:40 GMT]  Launching job on cluster gpu1-g5-525
[Tue, 01 Oct 2024 16:09:40 GMT]  
====================== Cluster submission script: ========================
==========================================================================
#!/bin/bash
#SBATCH --job-name=cryosparc_P56_J432
#SBATCH --output=/fsx/cryoem-processing/XXXX/CS-XXXX-sept/J432/job.log
#SBATCH --error=/fsx/cryoem-processing/XXXX/CS-XXXX-sept/J432/job.log
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=6
#SBATCH --gres=gpu:1
#SBATCH --partition=gpu1-g5-525
sudo sed -i '1s/^/10.99.168.213 ip-10-99-168-213\n /' /etc/hosts
/shared/cryosparc/cryosparc_worker/bin/cryosparcw run --project P56 --job J432 --master_hostname 10.99.168.213 --master_command_core_port 45002 > /fsx/cryoem-processing/XXXX/CS-XXXX-sept/J432/job.log 2>&1 


==========================================================================
==========================================================================
[Tue, 01 Oct 2024 16:09:40 GMT]  -------- Submission command: 
sbatch /fsx/cryoem-processing/XXXX/CS-XXXX-sept/J432/queue_sub_script.sh
[Tue, 01 Oct 2024 16:09:40 GMT]  -------- Cluster Job ID: 
9602
[Tue, 01 Oct 2024 16:09:40 GMT]  -------- Queued on cluster at 2024-10-01 16:09:40.689028
[Tue, 01 Oct 2024 16:09:41 GMT]  -------- Cluster job status at 2024-10-01 16:13:54.826631 (25 retries)
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              9602 gpu1-g5-5 cryospar   ubuntu  R       0:04      1 gpu1-g5-525-dy-g5-8xlarge-1
[Tue, 01 Oct 2024 16:14:01 GMT] [CPU RAM used: 90 MB] Job J432 Started
[Tue, 01 Oct 2024 16:14:01 GMT] [CPU RAM used: 90 MB] Master running v4.6.0, worker running v4.6.0
[Tue, 01 Oct 2024 16:14:01 GMT] [CPU RAM used: 91 MB] Working in directory: /fsx/cryoem-processing/XXXX/CS-XXXX-sept/J432
[Tue, 01 Oct 2024 16:14:01 GMT] [CPU RAM used: 91 MB] Running on lane gpu1-g5-525
[Tue, 01 Oct 2024 16:14:01 GMT] [CPU RAM used: 91 MB] Resources allocated:
[Tue, 01 Oct 2024 16:14:01 GMT] [CPU RAM used: 91 MB]   Worker:  gpu1-g5-525
[Tue, 01 Oct 2024 16:14:01 GMT] [CPU RAM used: 91 MB]   CPU   :  [0, 1, 2, 3, 4, 5]
[Tue, 01 Oct 2024 16:14:01 GMT] [CPU RAM used: 91 MB]   GPU   :  [0]
[Tue, 01 Oct 2024 16:14:01 GMT] [CPU RAM used: 91 MB]   RAM   :  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
[Tue, 01 Oct 2024 16:14:01 GMT] [CPU RAM used: 91 MB]   SSD   :  False
[Tue, 01 Oct 2024 16:14:01 GMT] [CPU RAM used: 91 MB] --------------------------------------------------------------
[Tue, 01 Oct 2024 16:14:01 GMT] [CPU RAM used: 91 MB] Importing job module for job type reference_motion_correction...
S9/import_movies/FoilHole_21559105_Data_21557734_0_20240920_181231_EER.eer
[Tue, 01 Oct 2024 23:46:41 GMT] [CPU RAM used: 6947 MB] Plotting trajectories and particles for movie 8604535981330866848
S9/import_movies/FoilHole_21559105_Data_21557737_0_20240920_181239_EER.eer
[Tue, 01 Oct 2024 23:46:45 GMT] [CPU RAM used: 7175 MB] Plotting trajectories and particles for movie 18252624766583027495
S9/import_movies/FoilHole_21559105_Data_21557740_0_20240920_181250_EER.eer
[Tue, 01 Oct 2024 23:47:16 GMT] [CPU RAM used: 16764 MB] Plotting trajectories and particles for movie 9936234382313998456
S9/import_movies/FoilHole_21559115_Data_21557734_0_20240920_181409_EER.eer
[Tue, 01 Oct 2024 23:47:21 GMT] [CPU RAM used: 7735 MB] Plotting trajectories and particles for movie 16132868517269104694
S9/import_movies/FoilHole_21559115_Data_21557737_0_20240920_181418_EER.eer
[Tue, 01 Oct 2024 23:47:49 GMT] [CPU RAM used: 15790 MB] Plotting trajectories and particles for movie 2546108472353545219
S9/import_movies/FoilHole_21559234_Data_21557734_39_20240920_181434_EER.eer
[Tue, 01 Oct 2024 23:47:54 GMT] [CPU RAM used: 8698 MB] Plotting trajectories and particles for movie 3156738655360191867
S9/import_movies/FoilHole_21559115_Data_21557740_0_20240920_181426_EER.eer
[Tue, 01 Oct 2024 23:48:31 GMT] [CPU RAM used: 7524 MB] Plotting trajectories and particles for movie 734182889653974394
S9/import_movies/FoilHole_21559234_Data_21557737_39_20240920_181442_EER.eer
[Tue, 01 Oct 2024 23:48:35 GMT] [CPU RAM used: 12193 MB] Plotting trajectories and particles for movie 3129815169519156607
S9/import_movies/FoilHole_21559234_Data_21557740_39_20240920_181451_EER.eer
[Tue, 01 Oct 2024 23:48:38 GMT] [CPU RAM used: 16554 MB] No further example plots will be made, but the job is still running (see progress bar above).
[Wed, 02 Oct 2024 10:14:33 GMT]  **** Kill signal sent by CryoSPARC (ID: <Heartbeat Monitor>) ****
[Wed, 02 Oct 2024 10:18:24 GMT]  Job is unresponsive - no heartbeat received in 240 seconds.
================= CRYOSPARCW =======  2024-10-01 16:13:57.560391  =========
Project P56 Job J432
Master 10.99.168.213 Port 45002
===========================================================================
MAIN PROCESS PID 8222
========= now starting main process at 2024-10-01 16:13:57.560919
motioncorrection.run_reference_motion cryosparc_compute.jobs.jobregister
/shared/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numpy/core/getlimits.py:499: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.
  setattr(self, word, getattr(machar, word).flat[0])
/shared/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.
  return self._float_to_str(self.smallest_subnormal)
/shared/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numpy/core/getlimits.py:499: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
  setattr(self, word, getattr(machar, word).flat[0])
/shared/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
  return self._float_to_str(self.smallest_subnormal)
MONITOR PROCESS PID 8224
========= sending heartbeat at 2024-10-02 10:04:22.243328
========= sending heartbeat at 2024-10-02 10:04:32.257703
========= sending heartbeat at 2024-10-02 10:04:42.272811
========= sending heartbeat at 2024-10-02 10:04:52.287329
========= sending heartbeat at 2024-10-02 10:05:02.301768
========= sending heartbeat at 2024-10-02 10:05:12.316043
========= sending heartbeat at 2024-10-02 10:05:22.330617
========= sending heartbeat at 2024-10-02 10:05:32.345355
========= sending heartbeat at 2024-10-02 10:05:42.360251
========= sending heartbeat at 2024-10-02 10:05:52.374642
========= sending heartbeat at 2024-10-02 10:06:02.388936
========= sending heartbeat at 2024-10-02 10:06:12.403730
========= sending heartbeat at 2024-10-02 10:06:22.418664
========= sending heartbeat at 2024-10-02 10:06:32.433591
========= sending heartbeat at 2024-10-02 10:06:42.448117
========= sending heartbeat at 2024-10-02 10:06:52.462729
========= sending heartbeat at 2024-10-02 10:07:02.477154
========= sending heartbeat at 2024-10-02 10:07:12.488272
========= sending heartbeat at 2024-10-02 10:07:22.502594
========= sending heartbeat at 2024-10-02 10:07:32.516948
========= sending heartbeat at 2024-10-02 10:07:42.531177
========= sending heartbeat at 2024-10-02 10:07:52.545868
========= sending heartbeat at 2024-10-02 10:08:02.560649
========= sending heartbeat at 2024-10-02 10:08:12.575232
========= sending heartbeat at 2024-10-02 10:08:22.590055
========= sending heartbeat at 2024-10-02 10:08:32.604408
========= sending heartbeat at 2024-10-02 10:08:42.618858
========= sending heartbeat at 2024-10-02 10:08:52.633336
========= sending heartbeat at 2024-10-02 10:09:02.647757
========= sending heartbeat at 2024-10-02 10:09:12.662022
========= sending heartbeat at 2024-10-02 10:09:22.681091
========= sending heartbeat at 2024-10-02 10:09:32.695557
========= sending heartbeat at 2024-10-02 10:09:42.710385
========= sending heartbeat at 2024-10-02 10:09:52.724873
========= sending heartbeat at 2024-10-02 10:10:02.739192
========= sending heartbeat at 2024-10-02 10:10:12.753804
========= sending heartbeat at 2024-10-02 10:10:22.768007
========= sending heartbeat at 2024-10-02 10:10:32.780593
<string>:1: UserWarning: *** CommandClient: command (http://10.99.168.213:45002/api) did not reply within timeout of 300 seconds, attempt 1 of 3
slurmstepd: error: *** JOB 9602 ON gpu1-g5-525-dy-g5-8xlarge-1 CANCELLED AT 2024-10-02T10:14:33 ***
{'_id': '66f5789311144c1b98b9d196', 'instance_information': {'CUDA_version': '11.8', 'available_memory': '115.29GB', 'cpu_model': 'AMD EPYC 7R32', 'driver_version': '12.0', 'gpu_info': [{'id': 0, 'mem': 23642177536, 'name': 'NVIDIA A10G', 'pcie': '0000:00:1e'}], 'ofd_hard_limit': 131072, 'ofd_soft_limit': 8192, 'physical_cores': 16, 'platform_architecture': 'x86_64', 'platform_node': 'gpu1-g5-525-dy-g5-8xlarge-1', 'platform_release': '5.15.0-1031-aws', 'platform_version': '#35~20.04.1-Ubuntu SMP Sat Feb 11 16:19:06 UTC 2023', 'total_memory': '124.47GB', 'used_memory': '8.10GB'}, 'job_type': 'reference_motion_correction', 'killed_at': 'Wed, 02 Oct 2024 10:14:33 GMT', 'params_spec': {}, 'project_uid': 'P56', 'started_at': 'Tue, 01 Oct 2024 16:14:01 GMT', 'status': 'failed', 'uid': 'J432', 'version': 'v4.6.0'}

Thank you for the help,

William

1 Like

@wjnicol I moved your post to its own topic because their is a chance that your issue has a cause distinct from the original topic.
Please can you check (and post) your GPU server instances’ setting for transparent hugepages

cat /sys/kernel/mm/transparent_hugepage/enabled 

You could run that command inside your cluster script, but may want to direct the script’s error and output to files other than the CryoSPARC job.log, like

#SBATCH --output={{ job_dir_abs }}/slurm.out
#SBATCH --error={{ job_dir_abs }}/slurm.err

Please can you also confirm in the specific GPU server instance’s system logs that no compute process (there may be compute processes in addition to main and monitor processes) was terminated by the kernel oom reaper.

A post was split to a new topic: This is a test topic

Hello, thank you for assisting.

The output to the first command is:

always [madvise] never

What is the command you are referring to?

Please can you also confirm in the specific GPU server instance’s system logs that no compute process (there may be compute processes in addition to main and monitor processes) was terminated by the kernel oom reaper.

I don’t know how to do that.

You may want to test if jobs continue to fail if this is set to
always madvise [never]
when the job is running (how?).

the command

cat /sys/kernel/mm/transparent_hugepage/enabled 

could be included in your CryoSPARC cluster script template just above the line

{{ run_cmd }}

to confirm the actual setting when the CryoSPARC job is started.

The information you posted earlier suggests that jobs run on ephemeral virtual machines in the cloud. Depending on your cloud settings, system logs may be preserved even after the termination of the virtual machine. If you can locate the system log relevant to the failed job, you could download and search inside the log file like this

grep -i oom /path/to/downloaded_logfile

Hi @wtempel, I’ve been troubleshooting this with @wjnicol. I tried following your suggestions, but we are still running into errors.

#!/bin/bash
#SBATCH --job-name=cryosparc_{{ project_uid }}_{{ job_uid }}
#SBATCH --output={{ job_dir_abs }}/slurm.out
#SBATCH --error={{ job_dir_abs }}/slurm.err
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task={{ num_cpu }}
#SBATCH --gres=gpu:{{ num_gpu }}
#SBATCH --partition=gpu1-g5

sudo bash -c ‘sync; echo 1 > /proc/sys/vm/drop_caches’
sudo bash -c “echo never > /sys/kernel/mm/transparent_hugepage/enabled”
sudo cat /sys/kernel/mm/transparent_hugepage/enabled
{{ run_cmd }}

The slurm.out file seems to suggest that the transparent hugepage setting was properly altered with the following output. I also tried running a job with, and without, the dropcache line the script. It didn’t seem to make a difference either way.

always madvise [never]

In case it is useful, here is the output from the cryosparc logfile before/after the first heartbeat error. We first observed this error after upgrading to v4.6.0. Any additional troubleshooting suggestions would be greatly appreciated! Happy to provide any additional details and/or logs that you might need.

========= sending heartbeat at 2024-10-07 16:24:36.959215

/shared/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/nvrtc.py:257: UserWarning: NVRTC log messages whilst compiling kernel:

kernel(735): warning #177-D: variable "zint" was declared but never referenced

warnings.warn(msg)

/shared/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/nvrtc.py:257: UserWarning: NVRTC log messages whilst compiling kernel:

kernel(35): warning #68-D: integer conversion resulted in a change of sign

kernel(44): warning #68-D: integer conversion resulted in a change of sign

kernel(17): warning #177-D: variable "N_I" was declared but never referenced

warnings.warn(msg)

/shared/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/nvrtc.py:257: UserWarning: NVRTC log messages whilst compiling kernel:

kernel(100): warning #177-D: variable "zint" was declared but never referenced

kernel(325): warning #177-D: variable "zint" was declared but never referenced

warnings.warn(msg)

/shared/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/nvrtc.py:257: UserWarning: NVRTC log messages whilst compiling kernel:

kernel(868): warning #177-D: variable "pos" was declared but never referenced

warnings.warn(msg)

========= sending heartbeat at 2024-10-07 16:24:46.973367

========= sending heartbeat at 2024-10-07 16:24:56.985353

========= sending heartbeat at 2024-10-07 16:25:07.001373

========= sending heartbeat at 2024-10-07 16:25:17.013361

========= sending heartbeat at 2024-10-07 16:25:27.025365

========= sending heartbeat at 2024-10-07 16:25:37.037345

========= sending heartbeat at 2024-10-07 16:25:47.049365

<string>:1: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected

========= sending heartbeat at 2024-10-07 16:25:57.061355

========= sending heartbeat at 2024-10-07 16:26:07.073404

/shared/cryosparc/cryosparc_worker/cryosparc_compute/util/logsumexp.py:41: RuntimeWarning: divide by zero encountered in log

return n.log(wa * n.exp(a - vmax) + wb * n.exp(b - vmax) ) + vmax

/shared/cryosparc/cryosparc_worker/cryosparc_compute/jobs/runcommon.py:2304: RuntimeWarning: divide by zero encountered in log

run_old(*args, **kw)

========= sending heartbeat at 2024-10-07 16:26:17.085370

========= sending heartbeat at 2024-10-07 16:26:27.097359

========= sending heartbeat at 2024-10-07 16:26:37.109356

========= sending heartbeat at 2024-10-07 16:26:47.121361

========= sending heartbeat at 2024-10-07 16:26:57.133363

<string>:1: RuntimeWarning: invalid value encountered in true_divide

========= sending heartbeat at 2024-10-07 16:27:07.145365

========= sending heartbeat at 2024-10-07 16:27:17.160154

========= sending heartbeat at 2024-10-07 16:27:27.173374

Cheers,
Evan

@emgreen Are you referring to an error in the corresponding job’s event log? If so, please can you post the last few lines from that event log, including the error and some context leading up to the error.
Text from a hypothetical job P99 J199’s eventlog can be displayed with the command

cryosparcm eventlog P99 J199

@wtempel Here are the last few lines from the event log. What I posted previously was from the J134/job.log file.

[Mon, 07 Oct 2024 16:52:37 GMT] [CPU RAM used: 3813 MB] Deleting last_output_file_path_abs: /fsx/cryoem-processing/project/CS-project-oct-2024/J134/J134_009_class_averages.mrc
[Mon, 07 Oct 2024 16:52:37 GMT] [CPU RAM used: 3813 MB] Removed output results for P57 J134
[Mon, 07 Oct 2024 16:52:38 GMT] [CPU RAM used: 3818 MB] Start of Iteration 11
[Mon, 07 Oct 2024 16:52:38 GMT] [CPU RAM used: 3818 MB] -- DEV 0 THR 0 NUM 20000 TOTAL 46.011991 ELAPSED 135.23532 --
[Mon, 07 Oct 2024 16:54:57 GMT] [CPU RAM used: 3871 MB] Finished engine iteration 11 in 138.911s
[Mon, 07 Oct 2024 16:54:57 GMT] [CPU RAM used: 3872 MB]   -- Effective number of classes per image: min 1.00 | 25-pct 8.61 | median 23.41 | 75-pct 50.06 | max 168.96
[Mon, 07 Oct 2024 16:54:57 GMT] [CPU RAM used: 3872 MB]   -- Probability of best class per image: min 0.00 | 25-pct 0.06 | median 0.12 | 75-pct 0.26 | max 1.00
[Mon, 07 Oct 2024 16:54:57 GMT] [CPU RAM used: 3872 MB] Solving 2D densities...
[Mon, 07 Oct 2024 16:54:57 GMT] [CPU RAM used: 3872 MB]     Solved class 200/200 in 0.21s
[Mon, 07 Oct 2024 16:55:07 GMT]  2D classes for iteration 11
[Mon, 07 Oct 2024 16:55:07 GMT]  Noise Model for iteration 11
[Mon, 07 Oct 2024 16:55:08 GMT]  Effective number of assigned classes for iteration 11
[Mon, 07 Oct 2024 16:55:08 GMT]  Probability of best class for iteration 11
[Mon, 07 Oct 2024 16:55:08 GMT] [CPU RAM used: 3992 MB] Done Full Iteration 11 took 150.288s for 80000 images
[Mon, 07 Oct 2024 16:55:10 GMT] [CPU RAM used: 4015 MB] Outputting results...
[Mon, 07 Oct 2024 16:55:11 GMT] [CPU RAM used: 4015 MB] Output particles to J134/J134_011_particles.cs
[Mon, 07 Oct 2024 16:55:11 GMT] [CPU RAM used: 4015 MB] Output class averages to J134/J134_011_class_averages.cs, J134/J134_011_class_averages.mrc
[Mon, 07 Oct 2024 16:55:11 GMT] [CPU RAM used: 4015 MB] Clearing previous iteration...
[Mon, 07 Oct 2024 16:55:11 GMT] [CPU RAM used: 4015 MB] Deleting last_output_file_path_abs: /fsx/cryoem-processing/project/CS-project-oct-2024/J134/J134_010_particles.cs
[Mon, 07 Oct 2024 16:55:11 GMT] [CPU RAM used: 4015 MB] Deleting last_output_file_path_abs: /fsx/cryoem-processing/project/CS-project-oct-2024/J134/J134_010_class_averages.cs
[Mon, 07 Oct 2024 16:55:11 GMT] [CPU RAM used: 4015 MB] Deleting last_output_file_path_abs: /fsx/cryoem-processing/project/CS-project-oct-2024/J134/J134_010_class_averages.mrc
[Mon, 07 Oct 2024 16:55:11 GMT] [CPU RAM used: 4015 MB] Removed output results for P57 J134
[Mon, 07 Oct 2024 16:55:12 GMT] [CPU RAM used: 4019 MB] Start of Iteration 12
[Mon, 07 Oct 2024 16:55:12 GMT] [CPU RAM used: 4019 MB] -- DEV 0 THR 1 NUM 20000 TOTAL 46.862222 ELAPSED 135.00626 --
[Mon, 07 Oct 2024 16:57:31 GMT] [CPU RAM used: 4070 MB] Finished engine iteration 12 in 138.906s
[Mon, 07 Oct 2024 16:57:31 GMT] [CPU RAM used: 4071 MB]   -- Effective number of classes per image: min 1.00 | 25-pct 6.11 | median 15.76 | 75-pct 34.41 | max 168.46
[Mon, 07 Oct 2024 16:57:31 GMT] [CPU RAM used: 4071 MB]   -- Probability of best class per image: min 0.00 | 25-pct 0.08 | median 0.16 | 75-pct 0.33 | max 1.00
[Mon, 07 Oct 2024 16:57:31 GMT] [CPU RAM used: 4071 MB] Solving 2D densities...
[Mon, 07 Oct 2024 16:57:31 GMT] [CPU RAM used: 4071 MB]     Solved class 200/200 in 0.21s
[Mon, 07 Oct 2024 16:57:41 GMT]  2D classes for iteration 12
[Mon, 07 Oct 2024 16:57:41 GMT]  Noise Model for iteration 12
[Mon, 07 Oct 2024 16:57:42 GMT]  Effective number of assigned classes for iteration 12
[Mon, 07 Oct 2024 16:57:42 GMT]  Probability of best class for iteration 12
[Mon, 07 Oct 2024 16:57:42 GMT] [CPU RAM used: 4194 MB] Done Full Iteration 12 took 150.326s for 80000 images
[Mon, 07 Oct 2024 16:57:44 GMT] [CPU RAM used: 4217 MB] Outputting results...
[Mon, 07 Oct 2024 16:57:45 GMT] [CPU RAM used: 4217 MB] Output particles to J134/J134_012_particles.cs
[Mon, 07 Oct 2024 16:57:45 GMT] [CPU RAM used: 4217 MB] Output class averages to J134/J134_012_class_averages.cs, J134/J134_012_class_averages.mrc
[Mon, 07 Oct 2024 16:57:45 GMT] [CPU RAM used: 4217 MB] Clearing previous iteration...
[Mon, 07 Oct 2024 16:57:45 GMT] [CPU RAM used: 4217 MB] Deleting last_output_file_path_abs: /fsx/cryoem-processing/project/CS-project-oct-2024/J134/J134_011_particles.cs
[Mon, 07 Oct 2024 16:57:45 GMT] [CPU RAM used: 4217 MB] Deleting last_output_file_path_abs: /fsx/cryoem-processing/project/CS-project-oct-2024/J134/J134_011_class_averages.cs
[Mon, 07 Oct 2024 16:57:45 GMT] [CPU RAM used: 4217 MB] Deleting last_output_file_path_abs: /fsx/cryoem-processing/project/CS-project-oct-2024/J134/J134_011_class_averages.mrc
[Mon, 07 Oct 2024 16:57:45 GMT] [CPU RAM used: 4217 MB] Removed output results for P57 J134
[Mon, 07 Oct 2024 16:57:46 GMT] [CPU RAM used: 4221 MB] Start of Iteration 13
[Mon, 07 Oct 2024 16:57:46 GMT] [CPU RAM used: 4221 MB] -- DEV 0 THR 0 NUM 16500 TOTAL 31.567989 ELAPSED 45.700442 --```

Thanks @emgreen . Please can you confirm:

  1. Are these lines are from the event log of J134?
  2. Is the job stuck at the event entry at [Mon, 07 Oct 2024 16:57:46 GMT]?
  3. What is the output of the command
    cryosparcm cli "get_job('P99', 'J134', 'job_type', 'version', 'instance_information', 'params_spec', 'status', 'started_at')"
    
  1. The previous except was from the eventlog for J134.
  2. That is correct. The process hung there until I finally sent a kill command at [Mon, 07 Oct 2024 21:50:09 GMT]
  3. See below.
ubuntu@ip-10-1-101-39:~$ cryosparcm cli "get_job('P57', 'J134', 'job_type', 'version', 'instance_information', 'params_spec', 'status', 'started_at')"
{'_id': '670184785a550d32e2931083', 'instance_information': {'CUDA_version': '11.8', 'available_memory': '115.46GB', 'cpu_model': 'AMD EPYC 7R32', 'driver_version': '12.2', 'gpu_info': [{'id': 0, 'mem': 23609475072, 'name': 'NVIDIA A10G', 'pcie': '0000:00:1e'}], 'ofd_hard_limit': 131072, 'ofd_soft_limit': 8192, 'physical_cores': 16, 'platform_architecture': 'x86_64', 'platform_node': 'gpu1-g5-dy-g5-8xlarge-1', 'platform_release': '5.15.0-1062-aws', 'platform_version': '#68~20.04.1-Ubuntu SMP Wed May 1 15:24:09 UTC 2024', 'total_memory': '124.46GB', 'used_memory': '7.98GB'}, 'job_type': 'class_2D_new', 'params_spec': {'class2D_K': {'value': 200}, 'class2D_num_full_iter_batch': {'value': 40}, 'class2D_num_full_iter_batchsize_per_class': {'value': 400}}, 'project_uid': 'P57', 'started_at': 'Mon, 07 Oct 2024 16:24:16 GMT', 'status': 'killed', 'uid': 'J134', 'version': 'v4.6.0'}

Another piece of information that might be useful is the observation that I was able to clone the job and have it finish successfully later in the day (with a different random seed). This is J142 and the associated get_job outputs are below. I wonder if the sporadic errors that we observe are somehow caused by cryosparc jobs queueing on a pcluster compute node that is not freshly initialized at the time of job creation (e.g. a previous job finishes and that on-demand node is used for a subsequent job but some resources are not properly freed up). That being said, we do routinely run multiple jobs on the same pcluster compute node sequentially without errors.

ubuntu@ip-10-1-101-39:~$ cryosparcm cli "get_job('P57', 'J142', 'job_type', 'version', 'instance_information', 'params_spec', 'status', 'started_at')"
{'_id': '670429385a550d32e25ebbe4', 'instance_information': {'CUDA_version': '11.8', 'available_memory': '115.53GB', 'cpu_model': 'AMD EPYC 7R32', 'driver_version': '12.2', 'gpu_info': [{'id': 0, 'mem': 23609475072, 'name': 'NVIDIA A10G', 'pcie': '0000:00:1e'}], 'ofd_hard_limit': 131072, 'ofd_soft_limit': 8192, 'physical_cores': 16, 'platform_architecture': 'x86_64', 'platform_node': 'gpu1-g5-dy-g5-8xlarge-2', 'platform_release': '5.15.0-1062-aws', 'platform_version': '#68~20.04.1-Ubuntu SMP Wed May 1 15:24:09 UTC 2024', 'total_memory': '124.46GB', 'used_memory': '7.92GB'}, 'job_type': 'class_2D_new', 'params_spec': {'class2D_K': {'value': 200}, 'class2D_num_full_iter_batch': {'value': 40}, 'class2D_num_full_iter_batchsize_per_class': {'value': 400}}, 'project_uid': 'P57', 'started_at': 'Mon, 07 Oct 2024 18:32:33 GMT', 'status': 'completed', 'uid': 'J142', 'version': 'v4.6.0'}

Thanks again for helping us troubleshoot this!
Evan

Perhaps related to my previous hypothesis (or just a red herring) I see the following line early in the job.log file for J134 but not J142.

2024-10-07 16:24:24,423 cleanup_junk_files   INFO     | Removed 1 invalid item(s) in the cache