Homologous Refinment Can't Compute Tight FSC

jr10 · December 1, 2020, 12:55am

Hi,

I’m currently running Cryosparc v2.15 (CUDA 10.2) on an AWS instance, Ubuntu OS, single GPU (Nvidia K80), 4 cores CPU, and 61 GB of ram.

My dataset is 29000 images, which are of a VLP with a box size of 640px. Looking at past cryosparc troubleshooting instances, this should be sufficient to process on my current machine.

I have successfully processed the data set on Legacy Homologous Refinement with downsampling to 320px. Now, any attempt to process the dataset in it’s full size fails at the same point regardless of changing the job parameters. I’ve attempted multiple times to reduced the computational load through minibatch size reduction, snrfactor reduction (40->10), and batchsize epsilon reduction (0.001->0.1). All jobs terminate abnormally with the same error:

[CPU: 30.59 GB] Computing FSCs…
[CPU: 83.7 MB] ====== Job process terminated abnormally.

Examining the joblog for the failed jobs presents one similarity between all jobs. The failure happens at the same point, where the job attempts to calculate the tight FSC. A representative joblog is included at the end of this post. Reintroducing downsampling to 320px fixes the issue, however nominal downsampling (ie. to 600px from 640px) returns the same errors. Any help would be greatly appreciated.

Best,
jr10

Running job J141 of type homo_refine
Running job on hostname %s ip-172-31-44-65.us-west-2.compute.internal
Allocated Resources : {u’lane’: u’default’, u’target’: {u’monitor_port’: None, u’lane’: u’default’, u’name’: u’ip-172-31-44-65.us-west-2.compute.internal’, u’title’: u’Worker node ip-172-31-44-65.us-west-2.compute.internal’, u’resource_slots’: {u’GPU’: [0], u’RAM’: [0, 1, 2, 3, 4, 5, 6, 7], u’CPU’: [0, 1, 2, 3]}, u’hostname’: u’ip-172-31-44-65.us-west-2.compute.internal’, u’worker_bin_path’: u’/home/cryosparc_user/software/cryosparc/cryosparc2_worker/bin/cryosparcw’, u’cache_path’: u’/data/cryosparc-ssd’, u’cache_quota_mb’: None, u’resource_fixed’: {u’SSD’: True}, u’gpus’: [{u’mem’: 11996954624, u’id’: 0, u’name’: u’Tesla K80’}], u’cache_reserve_mb’: 10000, u’type’: u’node’, u’ssh_str’: u’cryosparc_user@ip-172-31-44-65.us-west-2.compute.internal’, u’desc’: None}, u’license’: True, u’hostname’: u’ip-172-31-44-65.us-west-2.compute.internal’, u’slots’: {u’GPU’: [0], u’RAM’: [0, 1, 2], u’CPU’: [0, 1, 2, 3]}, u’fixed’: {u’SSD’: True}, u’lane_type’: u’default’, u’licenses_acquired’: 1}
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
cryosparc2_compute/plotutil.py:244: RuntimeWarning: divide by zero encountered in log
logabs = n.log(n.abs(fM))
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
FSC No-Mask… ========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
0.143 at 5.992 radwn. 0.5 at 3.291 radwn. Took 26.914s.
FSC Spherical Mask… ========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
0.143 at 6.349 radwn. 0.5 at 5.873 radwn. Took 39.884s.
FSC Loose Mask… ========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
0.143 at 6.500 radwn. 0.5 at 3.565 radwn. Took 170.589s.
FSC Tight Mask… ========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= main process now complete.
========= monitor process now complete.

apunjani · December 3, 2020, 10:11pm

Hi @jr10,
This would be happening because the system is running out of CPU RAM when computing FSCs. Several volumes need to be stored at the same time at this stage (halfmaps, masks, masked halfmaps, Fourier transforms of those, noise substituted volumes etc). The system will try to run the job but upon running out of system RAM, the job will be terminated directly by the system causing the “job process terminated abnormally”. Unfortunately there isn’t a way to reduce this memory requirement currently so you would have to pick an instance type with more CPU RAM.