Running Topaz jobs on AWS?

Hello,

We’ve been recently building out additional compute capability on AWS. However, all of our Topaz Training jobs have died with the “No Heartbeat” error between 5 and 70 minutes into the job. This is not something that we have observed when running on local machines. Could anyone suggest parameters and node types that seem to work for the cloud environment?

Many thanks!

Welcome to the forum @andreymgtk .
Please can you post for for a failed Topaz Train job the outputs of these commands on the CryoSPARC master host

csprojectid=P99 # replace with actual project ID
csjobid=J199 # replace with actual job ID of failed train job
cryosparcm cli "get_job('$csprojectid', '$csjobid', 'job_type', 'version',  'params_spec')"
cryosparcm joblog $csprojectid $csjobid | head -n 40
cryosparcm joblog $csprojectid $csjobid | tail -n 40
cryosparcm cli "get_scheduler_targets()"

Hello wtempel, @andreymgtk and myself are working on the same issue. The jobout of the topaz train is as follow.

[CPU: 318.4 MB]
Starting Topaz process using version 0.2.5a…

[CPU: 318.4 MB]
Random seed used is 777661043

[CPU: 318.4 MB]

[CPU: 318.4 MB]
Starting preprocessing…

[CPU: 318.4 MB]
Starting micrograph preprocessing by running command /efshome/miniconda3/envs/topaz/bin/topaz preprocess --scale 8 --niters 200 --num-workers 32 -o /projects/CSProjects/Andrey/CS-integrin/J161/preprocessed [6115 MICROGRAPH PATHS EXCLUDED FOR LEGIBILITY]

[CPU: 318.4 MB]
Preprocessing over 32 processes…

**** Kill signal sent by CryoSPARC (ID: ) ****

Job is unresponsive - no heartbeat received in 600 seconds.

And the job log is as follow…

================= CRYOSPARCW ======= 2025-01-23 17:35:58.776387 =========
Project P2 Job J161
Master 10.158.37.67 Port 39002

========= monitor process now starting main process at 2025-01-23 17:35:58.776422
MAINPROCESS PID 45811
========= monitor process now waiting for main process
MAIN PID 45811
topaz.run_topaz cryosparc_compute.jobs.jobregister
========= sending heartbeat at 2025-01-23 17:36:13.193720
========= sending heartbeat at 2025-01-23 17:36:23.210135
========= sending heartbeat at 2025-01-23 17:36:33.225750
========= sending heartbeat at 2025-01-23 17:36:43.240854
========= sending heartbeat at 2025-01-23 17:36:53.255998
========= sending heartbeat at 2025-01-23 17:37:03.271178
========= sending heartbeat at 2025-01-23 17:37:13.286328
========= sending heartbeat at 2025-01-23 17:37:23.302037
========= sending heartbeat at 2025-01-23 17:37:33.317274
========= sending heartbeat at 2025-01-23 17:37:43.332407
========= sending heartbeat at 2025-01-23 17:37:53.344216
========= sending heartbeat at 2025-01-23 17:38:03.356219
========= sending heartbeat at 2025-01-23 17:38:13.369281
========= sending heartbeat at 2025-01-23 17:38:23.384291
========= sending heartbeat at 2025-01-23 17:38:33.401291
========= sending heartbeat at 2025-01-23 17:38:43.872461
========= sending heartbeat at 2025-01-23 17:38:53.896924
========= sending heartbeat at 2025-01-23 17:39:03.916268
========= sending heartbeat at 2025-01-23 17:39:16.056222
========= sending heartbeat at 2025-01-23 17:39:26.080298
========= sending heartbeat at 2025-01-23 17:39:37.972239
========= sending heartbeat at 2025-01-23 17:39:48.200506
========= sending heartbeat at 2025-01-23 17:39:58.264655
========= sending heartbeat at 2025-01-23 17:40:08.402368
========= sending heartbeat at 2025-01-23 17:40:18.472539
========= sending heartbeat at 2025-01-23 17:40:28.504278
========= sending heartbeat at 2025-01-23 17:42:21.612915
========= sending heartbeat at 2025-01-23 17:42:31.668656
========= sending heartbeat at 2025-01-23 17:42:41.683917
========= sending heartbeat at 2025-01-23 17:42:51.699323
========= sending heartbeat at 2025-01-23 17:43:51.493251
========= sending heartbeat at 2025-01-23 17:44:47.081297
========= sending heartbeat at 2025-01-23 17:45:25.458730
========= sending heartbeat at 2025-01-23 17:45:44.907995
========= sending heartbeat at 2025-01-23 17:47:17.548987
========= sending heartbeat at 2025-01-23 17:48:14.464404
========= sending heartbeat at 2025-01-23 17:50:12.652470
========= sending heartbeat at 2025-01-23 17:51:00.684819
========= sending heartbeat at 2025-01-23 17:56:04.663372
========= sending heartbeat at 2025-01-23 17:56:28.202132
========= sending heartbeat at 2025-01-23 17:57:08.648210
========= sending heartbeat at 2025-01-23 17:57:18.666469
========= sending heartbeat at 2025-01-23 17:58:57.710242
========= sending heartbeat at 2025-01-23 18:03:42.037735
========= sending heartbeat at 2025-01-23 18:04:10.148783
========= sending heartbeat at 2025-01-23 18:08:27.783292
========= sending heartbeat at 2025-01-23 18:08:51.780298
========= sending heartbeat at 2025-01-23 18:10:57.766080
========= sending heartbeat at 2025-01-23 18:11:26.326942
========= sending heartbeat at 2025-01-23 18:12:46.633101
========= sending heartbeat at 2025-01-23 18:13:04.595961
========= sending heartbeat at 2025-01-23 18:13:32.497638
========= sending heartbeat at 2025-01-23 18:13:55.884226
========= sending heartbeat at 2025-01-23 18:14:07.190575
========= sending heartbeat at 2025-01-23 18:14:21.323937
========= sending heartbeat at 2025-01-23 18:14:46.330298
========= sending heartbeat at 2025-01-23 18:15:51.983331
========= sending heartbeat at 2025-01-23 18:16:06.842476
========= sending heartbeat at 2025-01-23 18:18:44.795969
========= sending heartbeat at 2025-01-23 18:19:56.223651
========= sending heartbeat at 2025-01-23 18:20:25.691280
========= sending heartbeat at 2025-01-23 18:20:47.254840
========= sending heartbeat at 2025-01-23 18:21:18.299501
========= sending heartbeat at 2025-01-23 18:21:42.210431
========= sending heartbeat at 2025-01-23 18:22:05.621251
========= sending heartbeat at 2025-01-23 18:22:45.675427
========= sending heartbeat at 2025-01-23 18:23:17.656941
========= sending heartbeat at 2025-01-23 18:23:39.645579
========= sending heartbeat at 2025-01-23 18:24:29.864808
========= sending heartbeat at 2025-01-23 18:25:13.770655
========= sending heartbeat at 2025-01-23 18:25:37.749438
========= sending heartbeat at 2025-01-23 18:26:49.719516
========= sending heartbeat at 2025-01-23 18:27:17.566423
========= sending heartbeat at 2025-01-23 18:27:34.325382
========= sending heartbeat at 2025-01-23 18:27:47.804993
========= sending heartbeat at 2025-01-23 18:28:10.839865
========= sending heartbeat at 2025-01-23 18:28:28.957608
========= sending heartbeat at 2025-01-23 18:28:52.541983
========= sending heartbeat at 2025-01-23 18:29:29.427396
========= sending heartbeat at 2025-01-23 18:30:22.849537
========= sending heartbeat at 2025-01-23 18:31:00.528217
========= sending heartbeat at 2025-01-23 18:31:25.544770
========= sending heartbeat at 2025-01-23 18:33:06.010471
========= sending heartbeat at 2025-01-23 18:33:22.710507
========= sending heartbeat at 2025-01-23 18:33:57.988032
========= sending heartbeat at 2025-01-23 18:34:22.648470
========= sending heartbeat at 2025-01-23 18:34:57.582386
========= sending heartbeat at 2025-01-23 18:35:23.710274
========= sending heartbeat at 2025-01-23 18:36:16.934326
========= sending heartbeat at 2025-01-23 18:37:40.986419
========= sending heartbeat at 2025-01-23 18:37:51.200402
========= sending heartbeat at 2025-01-23 18:38:01.264493
========= sending heartbeat at 2025-01-23 18:38:11.280350
========= sending heartbeat at 2025-01-23 18:38:21.296515
========= sending heartbeat at 2025-01-23 18:38:31.332468
========= sending heartbeat at 2025-01-23 18:38:41.348349
========= sending heartbeat at 2025-01-23 18:38:51.364520
========= sending heartbeat at 2025-01-23 18:39:01.380343
========= sending heartbeat at 2025-01-23 18:39:11.396502
========= sending heartbeat at 2025-01-23 18:39:21.422686
========= sending heartbeat at 2025-01-23 18:39:31.515494
========= sending heartbeat at 2025-01-23 18:39:41.892335
========= sending heartbeat at 2025-01-23 18:39:52.056220
========= sending heartbeat at 2025-01-23 18:40:02.856226

@Surya @andreymgtk You may want to check that

  1. THP is disabled ([never]) on the CryoSPARC worker (how?)
  2. a lower Number of CPUs (4?) is specified for the job
  3. a low Number of parallel processes (2?) is specified for the job