Ab initio failure

HI all,
i am trying to run Ab initio jobs but somehow it crashes along the way. if i ask for 1 or 2 ab initio model no issue, it works. but 3 and above the job starts fine, create the right number of model sand they look Ok until a certian number of iteration where it starts to get worse then finaly fail with teh following error
[CPU: 1.84 GB Avail: 250.70 GB]
Traceback (most recent call last):
File “cryosparc_master/cryosparc_compute/run.py”, line 115, in cryosparc_master.cryosparc_compute.run.main
File “cryosparc_master/cryosparc_compute/jobs/abinit/run.py”, line 288, in cryosparc_master.cryosparc_compute.jobs.abinit.run.run_homo_abinit
File “/home/bio21em2/Software/Cryosparc/cryosparc_worker/cryosparc_compute/noise_model.py”, line 119, in get_noise_estimate
assert n.all(n.isfinite(ret))
AssertionError

we are running cryosparc 4.5.1

I’ve run into this before, but only with non-standard parameters or unusual datasets. Is this happening with default params, on otherwise well-behaved datasets?

Yes standard parameters but dataset is average, relatively thick ice as particle is 55nm , contains 1M urea, 200kV, so you can imagine signal to noise could be better ! ctf fit is not great, on average 6A.
The maps goes to 4.8A.
One of the things I don’t get is why when selecting different number of models the output is either fine or crashes

Hey CSers,
I fall on to the same problem. So, Ab-initio with 2 classes were fine. However, it ends up on the same error when running with 3 classes, which turns out to be essential for my dataset.

I am running v4.5.3, probably on the same cluster as @ehanssen.
Any leads on this please?

@nameless_wonder Did you observe the AssertionError for this dataset only, or also for other datasets?

@wtempel Apologies, this is something I should have mentioned on the first place.
This is the only dataset I am working with at the moment; however, even with this dataset, Ab-initios with 3 classes went through successfully earlier, within the same workspace.

Thanks @nameless_wonder. Please can you

  1. clone one of the successful 3-class ab initio jobs
  2. run the cloned job in as similar as possible an environment as the old job
  3. share with the developers the job reports of the old (successful) and new (presumably failed) job.

I will send you a direct message about a suitable email address.

Dear @wtempel,

Surprisingly, the clone of the old successful job went through successfully. Then I cloned an old unsuccessful job, which fall into the same error.

I wanted to troubleshoot a bit, which took, me longer to respond (apologies). So, I cloned a successful job and linked particles from an unsuccessful one, and vice versa. However, now I am getting the following error:

“Traceback (most recent call last): File “cryosparc_master/cryosparc_compute/run.py”, line 115, in cryosparc_master.cryosparc_compute.run.main File “cryosparc_master/cryosparc_compute/jobs/abinit/run.py”, line 480, in cryosparc_master.cryosparc_compute.jobs.abinit.run.run_homo_abinit File “/apps/cryosparc/cryosparc-general/4.5.3/cryosparc_worker/cryosparc_compute/jobs/runcommon.py”, line 1002, in output_single_volume dset = create_single_volume_ds(map_r, psize, name, rel_path_no_ext, symop=symop, write_volume=write_volume, rel_path_mrc=rel_path_mrc) File “/apps/cryosparc/cryosparc-general/4.5.3/cryosparc_worker/cryosparc_compute/jobs/runcommon.py”, line 979, in create_single_volume_ds mrc.write_mrc(os.path.join(get_project_dir_abs(), rel_path_mrc), map_r, psize) File “/apps/cryosparc/cryosparc-general/4.5.3/cryosparc_worker/cryosparc_compute/blobio/mrc.py”, line 252, in write_mrc cryosparc_io.write_mrc( RuntimeError: couldn’t write to /cryosparc/co55/forhad/NegeVirus/CS-nege-virus/J254/J254_class_00_00000_volume.mrc”

Anyway, I have emailed the job reports (job.log files) of both the successful and unsuccessful cloned jobs. Please note, I have been running with variable class similarity (0 to 0.5) and I have always kept “Cache particle images to SSD” false. These parameters were applied to both the successful and unsuccessful jobs.

Successful job: successful_job.log - Google Drive

Unsuccessful job: unsuccessful_job.log - Google Drive

Current error:

@nameless_wonder Thanks for posting this info. Please can you post the outputs of these commands

cryosparcm cli "get_job('P46', 'J242', 'job_type', 'version', 'instance_information', 'status',  'params_spec', 'errors_run', 'input_slot_groups', 'started_at')"
cryosparcm cli "get_job('P46', 'J252', 'job_type', 'version', 'instance_information', 'status',  'params_spec', 'errors_run', 'input_slot_groups', 'started_at')"
cryosparcm joblog P46 J254 | tail -n 40

Dear @wtempel,

Apologies for the delay as I had to liaise with our cluster managers and also doing a bit more troubleshooting.

Here are the outputs:

$ cryosparcm cli “get_job(‘P46’, ‘J242’, ‘job_type’, ‘version’, ‘instance_information’, ‘status’, ‘params_spec’, ‘errors_run’, ‘input_slot_groups’, ‘started_at’)”
{‘_id’: ‘67a550eb0d1cd394b979b233’, ‘errors_run’: , ‘input_slot_groups’: [{‘connections’: [{‘group_name’: ‘particles_selected’, ‘job_uid’: ‘J31’, ‘slots’: [{‘group_name’: ‘particles_selected’, ‘job_uid’: ‘J31’, ‘result_name’: ‘blob’, ‘result_type’: ‘particle.blob’, ‘slot_name’: ‘blob’, ‘version’: ‘F’}, {‘group_name’: ‘particles_selected’, ‘job_uid’: ‘J31’, ‘result_name’: ‘ctf’, ‘result_type’: ‘particle.ctf’, ‘slot_name’: ‘ctf’, ‘version’: ‘F’}, {‘group_name’: ‘particles_selected’, ‘job_uid’: ‘J31’, ‘result_name’: ‘alignments2D’, ‘result_type’: ‘particle.alignments2D’, ‘slot_name’: None, ‘version’: ‘F’}, {‘group_name’: ‘particles_selected’, ‘job_uid’: ‘J31’, ‘result_name’: ‘pick_stats’, ‘result_type’: ‘particle.pick_stats’, ‘slot_name’: None, ‘version’: ‘F’}, {‘group_name’: ‘particles_selected’, ‘job_uid’: ‘J31’, ‘result_name’: ‘location’, ‘result_type’: ‘particle.location’, ‘slot_name’: None, ‘version’: ‘F’}]}], ‘count_max’: inf, ‘count_min’: 1, ‘description’: ‘Particle stacks to use. Multiple stacks will be concatenated.’, ‘name’: ‘particles’, ‘repeat_allowed’: False, ‘slots’: [{‘description’: ‘’, ‘name’: ‘blob’, ‘optional’: False, ‘title’: ‘Particle data blobs’, ‘type’: ‘particle.blob’}, {‘description’: ‘’, ‘name’: ‘ctf’, ‘optional’: False, ‘title’: ‘Particle ctf parameters’, ‘type’: ‘particle.ctf’}, {‘description’: ‘’, ‘name’: ‘alignments3D’, ‘optional’: True, ‘title’: ‘Computed alignments (optional – only used to passthrough half set splits.)’, ‘type’: ‘particle.alignments3D’}, {‘description’: ‘’, ‘name’: ‘filament’, ‘optional’: True, ‘title’: ‘Particle filament info’, ‘type’: ‘particle.filament’}], ‘title’: ‘Particle stacks’, ‘type’: ‘particle’}], ‘instance_information’: {‘CUDA_version’: ‘11.8’, ‘available_memory’: ‘950.83GB’, ‘cpu_model’: ‘Intel(R) Xeon(R) Gold 6230R CPU @ 2.10GHz’, ‘driver_version’: ‘12.2’, ‘gpu_info’: [{‘id’: 0, ‘mem’: 15655829504, ‘name’: ‘Tesla T4’, ‘pcie’: ‘0000:12:00’}], ‘ofd_hard_limit’: 524288, ‘ofd_soft_limit’: 1024, ‘physical_cores’: 52, ‘platform_architecture’: ‘x86_64’, ‘platform_node’: ‘m3t101’, ‘platform_release’: ‘5.14.0-284.25.1.el9_2.x86_64’, ‘platform_version’: ‘#1 SMP PREEMPT_DYNAMIC Wed Aug 2 14:53:30 UTC 2023’, ‘total_memory’: ‘1006.60GB’, ‘used_memory’: ‘50.98GB’}, ‘job_type’: ‘homo_abinit’, ‘params_spec’: {‘abinit_K’: {‘value’: 3}, ‘abinit_class_anneal_beta’: {‘value’: 0.5}, ‘compute_use_ssd’: {‘value’: False}}, ‘project_uid’: ‘P46’, ‘started_at’: ‘Fri, 07 Feb 2025 00:16:59 GMT’, ‘status’: ‘completed’, ‘uid’: ‘J242’, ‘version’: ‘v4.5.3’}

$ cryosparcm cli “get_job(‘P46’, ‘J252’, ‘job_type’, ‘version’, ‘instance_information’, ‘status’, ‘params_spec’, ‘errors_run’, ‘input_slot_groups’, ‘started_at’)”
{‘_id’: ‘67a57c4b0d1cd394b97e36b2’, ‘errors_run’: [{‘message’: ‘’, ‘warning’: False}], ‘input_slot_groups’: [{‘connections’: [{‘group_name’: ‘particles_selected’, ‘job_uid’: ‘J132’, ‘slots’: [{‘group_name’: ‘particles_selected’, ‘job_uid’: ‘J132’, ‘result_name’: ‘blob’, ‘result_type’: ‘particle.blob’, ‘slot_name’: ‘blob’, ‘version’: ‘F’}, {‘group_name’: ‘particles_selected’, ‘job_uid’: ‘J132’, ‘result_name’: ‘ctf’, ‘result_type’: ‘particle.ctf’, ‘slot_name’: ‘ctf’, ‘version’: ‘F’}, {‘group_name’: ‘particles_selected’, ‘job_uid’: ‘J132’, ‘result_name’: ‘alignments2D’, ‘result_type’: ‘particle.alignments2D’, ‘slot_name’: None, ‘version’: ‘F’}, {‘group_name’: ‘particles_selected’, ‘job_uid’: ‘J132’, ‘result_name’: ‘pick_stats’, ‘result_type’: ‘particle.pick_stats’, ‘slot_name’: None, ‘version’: ‘F’}, {‘group_name’: ‘particles_selected’, ‘job_uid’: ‘J132’, ‘result_name’: ‘location’, ‘result_type’: ‘particle.location’, ‘slot_name’: None, ‘version’: ‘F’}]}], ‘count_max’: inf, ‘count_min’: 1, ‘description’: ‘Particle stacks to use. Multiple stacks will be concatenated.’, ‘name’: ‘particles’, ‘repeat_allowed’: False, ‘slots’: [{‘description’: ‘’, ‘name’: ‘blob’, ‘optional’: False, ‘title’: ‘Particle data blobs’, ‘type’: ‘particle.blob’}, {‘description’: ‘’, ‘name’: ‘ctf’, ‘optional’: False, ‘title’: ‘Particle ctf parameters’, ‘type’: ‘particle.ctf’}, {‘description’: ‘’, ‘name’: ‘alignments3D’, ‘optional’: True, ‘title’: ‘Computed alignments (optional – only used to passthrough half set splits.)’, ‘type’: ‘particle.alignments3D’}, {‘description’: ‘’, ‘name’: ‘filament’, ‘optional’: True, ‘title’: ‘Particle filament info’, ‘type’: ‘particle.filament’}], ‘title’: ‘Particle stacks’, ‘type’: ‘particle’}], ‘instance_information’: {‘CUDA_version’: ‘11.8’, ‘available_memory’: ‘953.83GB’, ‘cpu_model’: ‘Intel(R) Xeon(R) Gold 6230R CPU @ 2.10GHz’, ‘driver_version’: ‘12.2’, ‘gpu_info’: [{‘id’: 0, ‘mem’: 15655829504, ‘name’: ‘Tesla T4’, ‘pcie’: ‘0000:12:00’}], ‘ofd_hard_limit’: 524288, ‘ofd_soft_limit’: 1024, ‘physical_cores’: 52, ‘platform_architecture’: ‘x86_64’, ‘platform_node’: ‘m3t101’, ‘platform_release’: ‘5.14.0-284.25.1.el9_2.x86_64’, ‘platform_version’: ‘#1 SMP PREEMPT_DYNAMIC Wed Aug 2 14:53:30 UTC 2023’, ‘total_memory’: ‘1006.60GB’, ‘used_memory’: ‘48.20GB’}, ‘job_type’: ‘homo_abinit’, ‘params_spec’: {‘abinit_K’: {‘value’: 3}, ‘abinit_class_anneal_beta’: {‘value’: 0}, ‘compute_use_ssd’: {‘value’: False}}, ‘project_uid’: ‘P46’, ‘started_at’: ‘Fri, 07 Feb 2025 03:22:02 GMT’, ‘status’: ‘failed’, ‘uid’: ‘J252’, ‘version’: ‘v4.5.3’}

$ cryosparcm joblog P46 J254 | tail -n 40
========= sending heartbeat at 2025-02-14 11:47:55.229980
========= sending heartbeat at 2025-02-14 11:48:05.243461
========= sending heartbeat at 2025-02-14 11:48:15.256215
========= sending heartbeat at 2025-02-14 11:48:25.262897
========= sending heartbeat at 2025-02-14 11:48:35.277559
========= sending heartbeat at 2025-02-14 11:48:45.292215
========= sending heartbeat at 2025-02-14 11:48:55.305206
========= sending heartbeat at 2025-02-14 11:49:05.318924
========= sending heartbeat at 2025-02-14 11:49:15.332998
========= sending heartbeat at 2025-02-14 11:49:25.347194
========= sending heartbeat at 2025-02-14 11:49:35.360204
========= sending heartbeat at 2025-02-14 11:49:45.373212
========= sending heartbeat at 2025-02-14 11:49:55.386790
========= sending heartbeat at 2025-02-14 11:50:05.402358
========= sending heartbeat at 2025-02-14 11:50:15.407858
========= sending heartbeat at 2025-02-14 11:50:25.421297
========= sending heartbeat at 2025-02-14 11:50:35.434206
========= sending heartbeat at 2025-02-14 11:50:45.448348
========= sending heartbeat at 2025-02-14 11:50:55.462205
========= sending heartbeat at 2025-02-14 11:51:05.475556
========= sending heartbeat at 2025-02-14 11:51:15.489803
========= sending heartbeat at 2025-02-14 11:51:25.503334
========= sending heartbeat at 2025-02-14 11:51:35.516202
========= sending heartbeat at 2025-02-14 11:51:45.529204
========= sending heartbeat at 2025-02-14 11:51:55.543954
========= sending heartbeat at 2025-02-14 11:52:05.559207
/apps/cryosparc/cryosparc-general/4.5.3/cryosparc_worker/cryosparc_compute/util/logsumexp.py:41: RuntimeWarning: divide by zero encountered in log
return n.log(wa * n.exp(a - vmax) + wb * n.exp(b - vmax) ) + vmax
:1: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected
========= sending heartbeat at 2025-02-14 11:52:15.573200
========= sending heartbeat at 2025-02-14 11:52:25.586943
2025-02-14 11:52:30,358 del INFO | Deleting plot real-slice-000
2025-02-14 11:52:30,386 del INFO | Deleting plot viewing_dist-000
2025-02-14 11:52:30,400 del INFO | Deleting plot real-slice-001
2025-02-14 11:52:30,426 del INFO | Deleting plot viewing_dist-001
2025-02-14 11:52:30,439 del INFO | Deleting plot real-slice-002
2025-02-14 11:52:30,464 del INFO | Deleting plot viewing_dist-002
2025-02-14 11:52:30,477 del INFO | Deleting plot noise_model


Besides, I was looking into exactly where the issue occurred. It turns out at some point I started having this error with ab-initio when I was tryiing to get rid of junk ptcls through 2D classification. So, as mentioned earlier, to replicate the scenario, I cloned an “unsuccessful” job and linked “successful” ptcls, and the ab-initio succeeded. However, when I ran those “successful” ptcls through 2 more rounds of 2D classifications, and did ab-initio again, it failed. Happy to post any details of these jobs as required. Please advise.

To add a bit more to it, I was monitoring an ab-initio job, from 600 iteration, they started looking like this:

@nameless_wonder Did this error occur in the same project P46, or another project? I am asking because the RuntimeError seems to be inconsistent with the output from

Has the job been re-run after the RuntimeError was observed?

Hi @wtempel

Yeah the job re-ran later, and yes, was in the same project P46. Apologies that I didn’t clarify on the first place.