Hello,
I am hoping in on this troubleshooting page because we are experiencing the same thing. It seems to happen after the latest update. It happens on a variety of jobs, but usually the ones that tend to take a while like NU refine, 2D classification or Reference-based Motion Correction. Following are the outputs you asked for on a Reference-based Motion Correction job that failed yesterday:
[Tue, 01 Oct 2024 16:09:40 GMT] License is valid.
[Tue, 01 Oct 2024 16:09:40 GMT] Launching job on lane gpu1-g5-525 target gpu1-g5-525 ...
[Tue, 01 Oct 2024 16:09:40 GMT] Launching job on cluster gpu1-g5-525
[Tue, 01 Oct 2024 16:09:40 GMT]
====================== Cluster submission script: ========================
==========================================================================
#!/bin/bash
#SBATCH --job-name=cryosparc_P56_J432
#SBATCH --output=/fsx/cryoem-processing/XXXX/CS-XXXX-sept/J432/job.log
#SBATCH --error=/fsx/cryoem-processing/XXXX/CS-XXXX-sept/J432/job.log
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=6
#SBATCH --gres=gpu:1
#SBATCH --partition=gpu1-g5-525
sudo sed -i '1s/^/10.99.168.213 ip-10-99-168-213\n /' /etc/hosts
/shared/cryosparc/cryosparc_worker/bin/cryosparcw run --project P56 --job J432 --master_hostname 10.99.168.213 --master_command_core_port 45002 > /fsx/cryoem-processing/XXXX/CS-XXXX-sept/J432/job.log 2>&1
==========================================================================
==========================================================================
[Tue, 01 Oct 2024 16:09:40 GMT] -------- Submission command:
sbatch /fsx/cryoem-processing/XXXX/CS-XXXX-sept/J432/queue_sub_script.sh
[Tue, 01 Oct 2024 16:09:40 GMT] -------- Cluster Job ID:
9602
[Tue, 01 Oct 2024 16:09:40 GMT] -------- Queued on cluster at 2024-10-01 16:09:40.689028
[Tue, 01 Oct 2024 16:09:41 GMT] -------- Cluster job status at 2024-10-01 16:13:54.826631 (25 retries)
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
9602 gpu1-g5-5 cryospar ubuntu R 0:04 1 gpu1-g5-525-dy-g5-8xlarge-1
[Tue, 01 Oct 2024 16:14:01 GMT] [CPU RAM used: 90 MB] Job J432 Started
[Tue, 01 Oct 2024 16:14:01 GMT] [CPU RAM used: 90 MB] Master running v4.6.0, worker running v4.6.0
[Tue, 01 Oct 2024 16:14:01 GMT] [CPU RAM used: 91 MB] Working in directory: /fsx/cryoem-processing/XXXX/CS-XXXX-sept/J432
[Tue, 01 Oct 2024 16:14:01 GMT] [CPU RAM used: 91 MB] Running on lane gpu1-g5-525
[Tue, 01 Oct 2024 16:14:01 GMT] [CPU RAM used: 91 MB] Resources allocated:
[Tue, 01 Oct 2024 16:14:01 GMT] [CPU RAM used: 91 MB] Worker: gpu1-g5-525
[Tue, 01 Oct 2024 16:14:01 GMT] [CPU RAM used: 91 MB] CPU : [0, 1, 2, 3, 4, 5]
[Tue, 01 Oct 2024 16:14:01 GMT] [CPU RAM used: 91 MB] GPU : [0]
[Tue, 01 Oct 2024 16:14:01 GMT] [CPU RAM used: 91 MB] RAM : [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
[Tue, 01 Oct 2024 16:14:01 GMT] [CPU RAM used: 91 MB] SSD : False
[Tue, 01 Oct 2024 16:14:01 GMT] [CPU RAM used: 91 MB] --------------------------------------------------------------
[Tue, 01 Oct 2024 16:14:01 GMT] [CPU RAM used: 91 MB] Importing job module for job type reference_motion_correction...
S9/import_movies/FoilHole_21559105_Data_21557734_0_20240920_181231_EER.eer
[Tue, 01 Oct 2024 23:46:41 GMT] [CPU RAM used: 6947 MB] Plotting trajectories and particles for movie 8604535981330866848
S9/import_movies/FoilHole_21559105_Data_21557737_0_20240920_181239_EER.eer
[Tue, 01 Oct 2024 23:46:45 GMT] [CPU RAM used: 7175 MB] Plotting trajectories and particles for movie 18252624766583027495
S9/import_movies/FoilHole_21559105_Data_21557740_0_20240920_181250_EER.eer
[Tue, 01 Oct 2024 23:47:16 GMT] [CPU RAM used: 16764 MB] Plotting trajectories and particles for movie 9936234382313998456
S9/import_movies/FoilHole_21559115_Data_21557734_0_20240920_181409_EER.eer
[Tue, 01 Oct 2024 23:47:21 GMT] [CPU RAM used: 7735 MB] Plotting trajectories and particles for movie 16132868517269104694
S9/import_movies/FoilHole_21559115_Data_21557737_0_20240920_181418_EER.eer
[Tue, 01 Oct 2024 23:47:49 GMT] [CPU RAM used: 15790 MB] Plotting trajectories and particles for movie 2546108472353545219
S9/import_movies/FoilHole_21559234_Data_21557734_39_20240920_181434_EER.eer
[Tue, 01 Oct 2024 23:47:54 GMT] [CPU RAM used: 8698 MB] Plotting trajectories and particles for movie 3156738655360191867
S9/import_movies/FoilHole_21559115_Data_21557740_0_20240920_181426_EER.eer
[Tue, 01 Oct 2024 23:48:31 GMT] [CPU RAM used: 7524 MB] Plotting trajectories and particles for movie 734182889653974394
S9/import_movies/FoilHole_21559234_Data_21557737_39_20240920_181442_EER.eer
[Tue, 01 Oct 2024 23:48:35 GMT] [CPU RAM used: 12193 MB] Plotting trajectories and particles for movie 3129815169519156607
S9/import_movies/FoilHole_21559234_Data_21557740_39_20240920_181451_EER.eer
[Tue, 01 Oct 2024 23:48:38 GMT] [CPU RAM used: 16554 MB] No further example plots will be made, but the job is still running (see progress bar above).
[Wed, 02 Oct 2024 10:14:33 GMT] **** Kill signal sent by CryoSPARC (ID: <Heartbeat Monitor>) ****
[Wed, 02 Oct 2024 10:18:24 GMT] Job is unresponsive - no heartbeat received in 240 seconds.
================= CRYOSPARCW ======= 2024-10-01 16:13:57.560391 =========
Project P56 Job J432
Master 10.99.168.213 Port 45002
===========================================================================
MAIN PROCESS PID 8222
========= now starting main process at 2024-10-01 16:13:57.560919
motioncorrection.run_reference_motion cryosparc_compute.jobs.jobregister
/shared/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numpy/core/getlimits.py:499: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.
setattr(self, word, getattr(machar, word).flat[0])
/shared/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.
return self._float_to_str(self.smallest_subnormal)
/shared/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numpy/core/getlimits.py:499: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
setattr(self, word, getattr(machar, word).flat[0])
/shared/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
return self._float_to_str(self.smallest_subnormal)
MONITOR PROCESS PID 8224
========= sending heartbeat at 2024-10-02 10:04:22.243328
========= sending heartbeat at 2024-10-02 10:04:32.257703
========= sending heartbeat at 2024-10-02 10:04:42.272811
========= sending heartbeat at 2024-10-02 10:04:52.287329
========= sending heartbeat at 2024-10-02 10:05:02.301768
========= sending heartbeat at 2024-10-02 10:05:12.316043
========= sending heartbeat at 2024-10-02 10:05:22.330617
========= sending heartbeat at 2024-10-02 10:05:32.345355
========= sending heartbeat at 2024-10-02 10:05:42.360251
========= sending heartbeat at 2024-10-02 10:05:52.374642
========= sending heartbeat at 2024-10-02 10:06:02.388936
========= sending heartbeat at 2024-10-02 10:06:12.403730
========= sending heartbeat at 2024-10-02 10:06:22.418664
========= sending heartbeat at 2024-10-02 10:06:32.433591
========= sending heartbeat at 2024-10-02 10:06:42.448117
========= sending heartbeat at 2024-10-02 10:06:52.462729
========= sending heartbeat at 2024-10-02 10:07:02.477154
========= sending heartbeat at 2024-10-02 10:07:12.488272
========= sending heartbeat at 2024-10-02 10:07:22.502594
========= sending heartbeat at 2024-10-02 10:07:32.516948
========= sending heartbeat at 2024-10-02 10:07:42.531177
========= sending heartbeat at 2024-10-02 10:07:52.545868
========= sending heartbeat at 2024-10-02 10:08:02.560649
========= sending heartbeat at 2024-10-02 10:08:12.575232
========= sending heartbeat at 2024-10-02 10:08:22.590055
========= sending heartbeat at 2024-10-02 10:08:32.604408
========= sending heartbeat at 2024-10-02 10:08:42.618858
========= sending heartbeat at 2024-10-02 10:08:52.633336
========= sending heartbeat at 2024-10-02 10:09:02.647757
========= sending heartbeat at 2024-10-02 10:09:12.662022
========= sending heartbeat at 2024-10-02 10:09:22.681091
========= sending heartbeat at 2024-10-02 10:09:32.695557
========= sending heartbeat at 2024-10-02 10:09:42.710385
========= sending heartbeat at 2024-10-02 10:09:52.724873
========= sending heartbeat at 2024-10-02 10:10:02.739192
========= sending heartbeat at 2024-10-02 10:10:12.753804
========= sending heartbeat at 2024-10-02 10:10:22.768007
========= sending heartbeat at 2024-10-02 10:10:32.780593
<string>:1: UserWarning: *** CommandClient: command (http://10.99.168.213:45002/api) did not reply within timeout of 300 seconds, attempt 1 of 3
slurmstepd: error: *** JOB 9602 ON gpu1-g5-525-dy-g5-8xlarge-1 CANCELLED AT 2024-10-02T10:14:33 ***
{'_id': '66f5789311144c1b98b9d196', 'instance_information': {'CUDA_version': '11.8', 'available_memory': '115.29GB', 'cpu_model': 'AMD EPYC 7R32', 'driver_version': '12.0', 'gpu_info': [{'id': 0, 'mem': 23642177536, 'name': 'NVIDIA A10G', 'pcie': '0000:00:1e'}], 'ofd_hard_limit': 131072, 'ofd_soft_limit': 8192, 'physical_cores': 16, 'platform_architecture': 'x86_64', 'platform_node': 'gpu1-g5-525-dy-g5-8xlarge-1', 'platform_release': '5.15.0-1031-aws', 'platform_version': '#35~20.04.1-Ubuntu SMP Sat Feb 11 16:19:06 UTC 2023', 'total_memory': '124.47GB', 'used_memory': '8.10GB'}, 'job_type': 'reference_motion_correction', 'killed_at': 'Wed, 02 Oct 2024 10:14:33 GMT', 'params_spec': {}, 'project_uid': 'P56', 'started_at': 'Tue, 01 Oct 2024 16:14:01 GMT', 'status': 'failed', 'uid': 'J432', 'version': 'v4.6.0'}
Thank you for the help,
William