Queued Waiting because inputs arre not ready

hsosa · August 12, 2024, 5:20pm

Dear CS,

I have encountered the following issue a few times: after queuing a job the job sits apparently forever in the following status::

queued
Waiting because inputs are not ready

At this point the job cannot be cleared or deleted.
This has happened during a signal subtraction job and in a 3D volume classification job.

As far as I know all the initial input are OK

Is this normal should I just wait until the “inputs are processed”?. It seems to hang there with no messages for over 1/2 hour.

Thanks

wtempel · August 12, 2024, 9:39pm

A job could could legitimately be waiting for longer than half an hour if

the inputs were not ready when the job was queued and
upstream jobs in fact take a long time to produce the relevant outputs

Please can you post

the CryoSPARC version
the project UID
a screenshot of the expanded Inputs section under the Inputs and Parameters tab showing the individual input slots
the job types of the jobs whose outputs were connected to the “waiting” job.

[edited for contents and spelling]

hsosa · August 13, 2024, 3:03pm

CryoSPARC v4.5.3

Project UID: Not sure where that number is. The project is P7

Inputs:

the job types of the jobs whose outputs were connected to the “waiting” job

As far as I know there are no pending jobs before.

The job does hang indefinitely (I left it overnight):

and I can’t kill it or delete it unless I restart cryosparc

ie.

cryosparcm stop

crysosparcm start

Not sure if this part of the problem but the input particles are from a symmetry expansion job. Running a particle subtraction job with the non-symmetry expanded particle dataset does seem to work.

Thanks

hsosa · August 14, 2024, 4:50pm

I ran a test to check if the problem was due to using the asymmetric expanded particles as input, but this does not seem to be the problem. I re-stacked the particles and gave them as input to the signal subtraction job and the same problem arises:
Job status:
queued
waiting for inputs

I also tried to use a reduced size symmetry expanded particles dataset as as input to test whether the problem was due to too many particles (~5 million). The reduced symmetry expanded set (order factor 2, approx 700000 particles output) as input to the the signal subtraction job resulted in the same problem:
Job status:
queued
waiting for inputs

As I said the job stays in this status longer than any other other job performed with the same dataset (at least overnight). and with no indication about what these ‘inputs’ that it is waiting for may be.

Has the signal subtraction job being tested with helical and helical symmetry expanded particles? Are there any examples?

Thanks

Hernando

wtempel · August 14, 2024, 5:49pm

Thanks @hsosa for these details. It may not be possible to delete or kill a job in the waiting state. Please can you show the actions available in the menu of the job card for J49 when J49 is waiting for inputs, similar to this example:

Are you able to clear the job?
Please can you also post the outputs of these commands

for job in J31 J36 J44 J49
do
  cryosparcm cli "get_job('P7', '$job', 'job_type', 'version', 'status', 'queued_at', 'started_at', 'completed_at')"
done

hsosa · August 14, 2024, 6:22pm

Yes I could clear the job:

do
cryosparcm cli “get_job(‘P7’, ‘$job’, ‘job_type’, ‘version’, ‘status’, ‘queued_at’, ‘started_at’, ‘completed_at’)”
done

{‘_id’: ‘66b50c2159deb4dbc7da829a’, ‘completed_at’: ‘Thu, 08 Aug 2024 18:25:03 GMT’, ‘job_type’: ‘sym_expand’, ‘project_uid’: ‘P7’, ‘queued_at’: ‘Thu, 08 Aug 2024 18:22:16 GMT’, ‘started_at’: ‘Thu, 08 Aug 2024 18:22:19 GMT’, ‘status’: ‘completed’, ‘uid’: ‘J31’, ‘version’: ‘v4.5.3’}
{‘_id’: ‘66b5290759deb4dbc7dbb44f’, ‘completed_at’: None, ‘job_type’: ‘new_local_refine’, ‘project_uid’: ‘P7’, ‘queued_at’: ‘Thu, 08 Aug 2024 20:25:16 GMT’, ‘started_at’: ‘Thu, 08 Aug 2024 20:25:19 GMT’, ‘status’: ‘failed’, ‘uid’: ‘J36’, ‘version’: ‘v4.5.3’}
{‘_id’: ‘66ba3e5538f5e07249e6b15e’, ‘completed_at’: ‘Mon, 12 Aug 2024 17:00:00 GMT’, ‘job_type’: ‘volume_tools’, ‘project_uid’: ‘P7’, ‘queued_at’: ‘Mon, 12 Aug 2024 16:59:05 GMT’, ‘started_at’: ‘Mon, 12 Aug 2024 16:59:08 GMT’, ‘status’: ‘completed’, ‘uid’: ‘J44’, ‘version’: ‘v4.5.3’}
{‘_id’: ‘66bb6dbbd789a3c1957fe1c3’, ‘completed_at’: None, ‘job_type’: ‘particle_subtract’, ‘project_uid’: ‘P7’, ‘queued_at’: None, ‘started_at’: None, ‘status’: ‘building’, ‘uid’: ‘J49’, ‘version’: ‘v4.5.3’}

hsosa · August 14, 2024, 6:37pm

I realized that J36 a local refinement job failed to finish. It says;
Job is unresponsive - no heartbeat received in 180 seconds.

I guess I was using one of the intermediate volumes output of this unfinished job as an input to the signal subtraction job. So maybe this job was waiting for the failed job to finish ???
I don’t know, though why the local refinement job ended in failure. This job was also using the symmetry expanded particles as input.

Thanks

wtempel · August 14, 2024, 7:08pm

This would be a plausible cause for the continued waiting of J49 for J36 outputs.

Please can you post the outputs of these commands:

cryosparcm joblog P7 J36 | tail -n 20
cryosparcm eventlog P7 J36 | tail -n 40
cryosparcm cli "get_job('P7', 'J36', 'instance_information', 'failed_at')"

hsosa · August 14, 2024, 8:01pm

cryosparcm joblog P7 J36 | tail -n 20
========= sending heartbeat at 2024-08-09 17:26:29.411399
========= sending heartbeat at 2024-08-09 17:26:39.431868
========= sending heartbeat at 2024-08-09 17:26:49.453168
========= sending heartbeat at 2024-08-09 17:26:59.476487
========= sending heartbeat at 2024-08-09 17:27:09.497618
========= sending heartbeat at 2024-08-09 17:27:19.517979
========= sending heartbeat at 2024-08-09 17:27:29.542362
========= sending heartbeat at 2024-08-09 17:27:39.562882
========= sending heartbeat at 2024-08-09 17:27:49.583190
========= sending heartbeat at 2024-08-09 17:27:59.603462
========= sending heartbeat at 2024-08-09 17:28:09.624962
========= sending heartbeat at 2024-08-09 17:28:19.646636
========= sending heartbeat at 2024-08-09 17:28:29.667578
========= sending heartbeat at 2024-08-09 17:28:39.687909
========= sending heartbeat at 2024-08-09 17:28:49.709479
========= sending heartbeat at 2024-08-09 17:28:59.730702
========= sending heartbeat at 2024-08-09 17:29:09.750923
========= sending heartbeat at 2024-08-09 17:29:19.771319
========= sending heartbeat at 2024-08-09 17:29:29.791653
/home/cryosparc_user/software/cryosparc/cryosparc_worker/bin/cryosparcw: line 150: 44727 Quit (core dumped) python -c “import cryosparc_compute.run as run; run.run()” “$@”

cryosparcm eventlog P7 J36 | tail -n 40
[Fri, 09 Aug 2024 15:41:08 GMT] [CPU RAM used: 69741 MB] Computing FFTs on GPU.
[Fri, 09 Aug 2024 15:41:13 GMT] [CPU RAM used: 69746 MB] Done in 4.655s
[Fri, 09 Aug 2024 15:41:13 GMT] [CPU RAM used: 69746 MB] Computing cFSCs…
[Fri, 09 Aug 2024 15:41:22 GMT] [CPU RAM used: 69746 MB] Done in 9.121s
[Fri, 09 Aug 2024 15:41:22 GMT] [CPU RAM used: 69746 MB] Using Filter Radius 219.052 (2.966A) | Previous: 216.173 (3.006A)
[Fri, 09 Aug 2024 15:41:43 GMT] [CPU RAM used: 64786 MB] Non-uniform regularization with compute option: GPU
[Fri, 09 Aug 2024 15:41:43 GMT] [CPU RAM used: 64786 MB] Running local cross validation for A …
[Fri, 09 Aug 2024 15:43:17 GMT] [CPU RAM used: 66439 MB] Local cross validation A done in 93.464s
[Fri, 09 Aug 2024 15:43:18 GMT] FSC Filtered Side A
[Fri, 09 Aug 2024 15:43:18 GMT] CV Filtered Side A
[Fri, 09 Aug 2024 15:43:18 GMT] [CPU RAM used: 66439 MB] Running local cross validation for B …
[Fri, 09 Aug 2024 15:44:42 GMT] [CPU RAM used: 68089 MB] Local cross validation B done in 84.090s
[Fri, 09 Aug 2024 15:44:44 GMT] FSC Filtered Side B
[Fri, 09 Aug 2024 15:44:44 GMT] CV Filtered Side B
[Fri, 09 Aug 2024 15:45:21 GMT] [CPU RAM used: 68269 MB] Estimated Bfactor: -133.4
[Fri, 09 Aug 2024 15:45:21 GMT] [CPU RAM used: 68269 MB] Plotting…
[Fri, 09 Aug 2024 15:45:39 GMT] Real Space Slices Iteration 003
[Fri, 09 Aug 2024 15:45:41 GMT] Fourier Space Slices Iteration 003
[Fri, 09 Aug 2024 15:45:43 GMT] Real Space Mask Slices Iteration 003
[Fri, 09 Aug 2024 15:45:43 GMT] FSC Iteration 003
[Fri, 09 Aug 2024 15:45:43 GMT] cFSCs (Half-angle: 20°) Iteration 003, with tight mask
[Fri, 09 Aug 2024 15:45:57 GMT] Guinier Plot Iteration 003
[Fri, 09 Aug 2024 15:45:57 GMT] Noise Model Iteration 003
[Fri, 09 Aug 2024 15:46:06 GMT] Viewing Direction Distribution Iteration 003
[Fri, 09 Aug 2024 15:46:07 GMT] Posterior Precision Directional Distribution Iteration 003
[Fri, 09 Aug 2024 15:46:45 GMT] Magnitudes of alignment changes Iteration 003
[Fri, 09 Aug 2024 15:46:57 GMT] Per particle scale factors 003
[Fri, 09 Aug 2024 15:46:57 GMT] [CPU RAM used: 68607 MB] Done in 95.369s.
[Fri, 09 Aug 2024 15:46:57 GMT] [CPU RAM used: 68607 MB] Outputting files…
[Fri, 09 Aug 2024 15:47:39 GMT] [CPU RAM used: 67905 MB] Done in 42.525s.
[Fri, 09 Aug 2024 15:47:39 GMT] [CPU RAM used: 67905 MB] Done iteration 3 in 21905.059s. Total time so far 69735.381s
[Fri, 09 Aug 2024 15:47:40 GMT] [CPU RAM used: 67905 MB] ----------------------------- Start Iteration 4
[Fri, 09 Aug 2024 15:47:40 GMT] [CPU RAM used: 67905 MB] Using Max Alignment Radius 219.052 (2.966A)
[Fri, 09 Aug 2024 15:47:40 GMT] [CPU RAM used: 67905 MB] Using Full Dataset (split 2747535 in A, 2747535 in B)
[Fri, 09 Aug 2024 15:47:50 GMT] [CPU RAM used: 67779 MB] Current alpha values ( 1.00 | 1.00 | 1.00 | 1.00 | 1.00 )
[Fri, 09 Aug 2024 15:47:50 GMT] [CPU RAM used: 67779 MB] – THR 0 BATCH 500 NUM 559000 TOTAL 2269.6214 ELAPSED 20505.030 –
[Fri, 09 Aug 2024 15:47:50 GMT] Alignment map A
[Fri, 09 Aug 2024 15:47:51 GMT] Alignment map B
[Fri, 09 Aug 2024 21:32:29 GMT] **** Kill signal sent by CryoSPARC (ID: ) ****
[Fri, 09 Aug 2024 21:33:27 GMT] Job is unresponsive - no heartbeat received in 180 seconds.

cryosparcm cli “get_job(‘P7’, ‘J36’, ‘instance_information’, ‘failed_at’)”
{‘_id’: ‘66b5290759deb4dbc7dbb44f’, ‘failed_at’: ‘Fri, 09 Aug 2024 21:33:04 GMT’, ‘instance_information’: {‘CUDA_version’: ‘11.8’, ‘available_memory’: ‘494.45GB’, ‘cpu_model’: ‘Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz’, ‘driver_version’: ‘12.2’, ‘gpu_info’: [{‘id’: 0, ‘mem’: 51038388224, ‘name’: ‘NVIDIA RTX A6000’, ‘pcie’: ‘0000:31:00’}, {‘id’: 1, ‘mem’: 51041271808, ‘name’: ‘NVIDIA RTX A6000’, ‘pcie’: ‘0000:4b:00’}, {‘id’: 2, ‘mem’: 51041271808, ‘name’: ‘NVIDIA RTX A6000’, ‘pcie’: ‘0000:b1:00’}, {‘id’: 3, ‘mem’: 51041271808, ‘name’: ‘NVIDIA RTX A6000’, ‘pcie’: ‘0000:ca:00’}], ‘ofd_hard_limit’: 1048576, ‘ofd_soft_limit’: 1024, ‘physical_cores’: 32, ‘platform_architecture’: ‘x86_64’, ‘platform_node’: ‘lomaredonda’, ‘platform_release’: ‘5.15.0-91-generic’, ‘platform_version’: ‘#101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023’, ‘total_memory’: ‘503.49GB’, ‘used_memory’: ‘4.52GB’}, ‘project_uid’: ‘P7’, ‘uid’: ‘J36’}

wtempel · August 28, 2024, 9:34pm

@hsosa Our best guess is that the the worker computer on which the local refine job ran may have run out of system RAM and started swapping or stalled for some other reason.