Local Refinement stalling at first iteration

Hello,

We have been trying to run a local refinement job (C2 sym, 300pix box, 200k particles) that keeps stalling at iteration 000. The job generates maps A and B and then seems to stop but never fails (will run for days). Has anyone encountered this and, if so, what was the solution? Thanks for the help!

Best,
Kyle

@KyleBarrie Please can you post the outputs of these commands:

cryosparcm cli "get_job('P99', 'J199', 'version', 'job_type', 'params_spec', 'status', 'instance_information', 'input_slot_groups')"
cryosparcm eventlog P99 J199 | tail -n 40
cryosparcm joblog P99 J199 | tail -n 20

where you would replace P99 and J199 with the stalled job’s project and job IDs, respectively.

Hello,

Thanks for the quick reply, please find the outputs below:

{'_id': '668fe3a487879a989a55ad76', 'input_slot_groups': [{'connections': [{'group_name': 'particles', 'job_uid': 'J846', 'slots': [{'group_name': 'particles', 'job_uid': 'J846', 'result_name': 'blob', 'result_type': 'particle.blob', 'slot_name': 'blob', 'version': 'F'}, {'group_name': 'particles', 'job_uid': 'J846', 'result_name': 'ctf', 'result_type': 'particle.ctf', 'slot_name': 'ctf', 'version': 'F'}, {'group_name': 'particles', 'job_uid': 'J846', 'result_name': 'alignments3D', 'result_type': 'particle.alignments3D', 'slot_name': 'alignments3D', 'version': 'F'}, {'group_name': 'particles', 'job_uid': 'J846', 'result_name': 'alignments2D', 'result_type': 'particle.alignments2D', 'slot_name': None, 'version': 'F'}, {'group_name': 'particles', 'job_uid': 'J846', 'result_name': 'pick_stats', 'result_type': 'particle.pick_stats', 'slot_name': None, 'version': 'F'}, {'group_name': 'particles', 'job_uid': 'J846', 'result_name': 'location', 'result_type': 'particle.location', 'slot_name': None, 'version': 'F'}, {'group_name': 'particles', 'job_uid': 'J846', 'result_name': 'ml_properties', 'result_type': 'particle.ml_properties', 'slot_name': None, 'version': 'F'}]}], 'count_max': inf, 'count_min': 1, 'description': 'Particle stacks to use. Multiple stacks will be concatenated.', 'name': 'particles', 'repeat_allowed': False, 'slots': [{'description': '', 'name': 'blob', 'optional': False, 'title': 'Particle data blobs', 'type': 'particle.blob'}, {'description': '', 'name': 'ctf', 'optional': False, 'title': 'Particle ctf parameters', 'type': 'particle.ctf'}, {'description': '', 'name': 'alignments3D', 'optional': False, 'title': 'Particle 3D alignments', 'type': 'particle.alignments3D'}], 'title': 'Particle stacks', 'type': 'particle'}, {'connections': [{'group_name': 'volume', 'job_uid': 'J846', 'slots': [{'group_name': 'volume', 'job_uid': 'J846', 'result_name': 'map', 'result_type': 'volume.blob', 'slot_name': 'map', 'version': 'F'}, {'group_name': 'volume', 'job_uid': 'J846', 'result_name': 'map_sharp', 'result_type': 'volume.blob', 'slot_name': None, 'version': 'F'}, {'group_name': 'volume', 'job_uid': 'J846', 'result_name': 'map_half_A', 'result_type': 'volume.blob', 'slot_name': None, 'version': 'F'}, {'group_name': 'volume', 'job_uid': 'J846', 'result_name': 'map_half_B', 'result_type': 'volume.blob', 'slot_name': None, 'version': 'F'}, {'group_name': 'volume', 'job_uid': 'J846', 'result_name': 'mask_refine', 'result_type': 'volume.blob', 'slot_name': None, 'version': 'F'}, {'group_name': 'volume', 'job_uid': 'J846', 'result_name': 'mask_fsc', 'result_type': 'volume.blob', 'slot_name': None, 'version': 'F'}, {'group_name': 'volume', 'job_uid': 'J846', 'result_name': 'mask_fsc_auto', 'result_type': 'volume.blob', 'slot_name': None, 'version': 'F'}, {'group_name': 'volume', 'job_uid': 'J846', 'result_name': 'precision', 'result_type': 'volume.blob', 'slot_name': None, 'version': 'F'}]}], 'count_max': 1, 'count_min': 1, 'description': '', 'name': 'volume', 'repeat_allowed': False, 'slots': [{'description': '', 'name': 'map', 'optional': False, 'title': 'Initial volume raw data', 'type': 'volume.blob'}], 'title': 'Initial volume', 'type': 'volume'}, {'connections': [{'group_name': 'mask', 'job_uid': 'J862', 'slots': [{'group_name': 'mask', 'job_uid': 'J862', 'result_name': 'mask', 'result_type': 'volume.blob', 'slot_name': 'mask', 'version': 'F'}]}], 'count_max': 1, 'count_min': 1, 'description': '', 'name': 'mask', 'repeat_allowed': False, 'slots': [{'description': '', 'name': 'mask', 'optional': False, 'title': 'Static mask', 'type': 'volume.blob'}], 'title': 'Static mask', 'type': 'mask'}], 'instance_information': {'CUDA_version': '11.8', 'available_memory': '179.20GB', 'cpu_model': 'Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GHz', 'driver_version': '12.0', 'gpu_info': [{'id': 0, 'mem': 25438126080, 'name': 'NVIDIA GeForce RTX 3090'}], 'ofd_hard_limit': 4096, 'ofd_soft_limit': 1024, 'physical_cores': 20, 'platform_architecture': 'x86_64', 'platform_node': 'spgpu', 'platform_release': '3.10.0-1160.49.1.el7.x86_64', 'platform_version': '#1 SMP Tue Nov 30 15:51:32 UTC 2021', 'total_memory': '188.39GB', 'used_memory': '8.44GB'}, 'job_type': 'new_local_refine', 'params_spec': {'refine_symmetry': {'value': 'C2'}}, 'project_uid': 'P24', 'status': 'running', 'uid': 'J869', 'version': 'v4.4.0'}

[CPU RAM used: 2285 MB]   Initializing noise model... (2/2)
Noise Model Initialization (2/2)
[CPU RAM used: 3443 MB]   Processed 199107.000 images in 230.573s.
[CPU RAM used: 3663 MB]   Computing FSCs...
[CPU RAM used: 3663 MB]   Using full box size 300, downsampled box size 160, with low memory mode disabled.
[CPU RAM used: 3663 MB]   Computing FFTs on GPU.
[CPU RAM used: 3813 MB]     Done in 1.616s
[CPU RAM used: 3813 MB]   Using Filter Radius 57.391 (4.495A) | Previous: 21.500 (12.000A)
[CPU RAM used: 4697 MB]   Non-uniform regularization with compute option: GPU
[CPU RAM used: 4593 MB]   Running local cross validation for A ...
[CPU RAM used: 7004 MB]   Local cross validation A done in 16.569s
FSC Filtered Side A
CV Filtered Side A
[CPU RAM used: 7006 MB]   Running local cross validation for B ...
[CPU RAM used: 7338 MB]   Local cross validation B done in 15.027s
FSC Filtered Side B
CV Filtered Side B
[CPU RAM used: 7661 MB]   Estimated Bfactor: -184.7
[CPU RAM used: 7661 MB]   Plotting..
Real Space Slices Iteration 000
Fourier Space Slices Iteration 000
Real Space Mask Slices Iteration 000
FSC Iteration 000
Guinier Plot Iteration 000
Noise Model Iteration 000
Viewing Direction Distribution Iteration 000
Posterior Precision Directional Distribution Iteration 000
Magnitudes of alignment changes Iteration 000
Per particle scale factors 000
[CPU RAM used: 8010 MB]     Done in 16.162s.
[CPU RAM used: 8010 MB]   Outputting files..
[CPU RAM used: 8423 MB]     Done in 4.748s.
[CPU RAM used: 8423 MB] Done iteration 0 in 302.652s. Total time so far 681.325s
[CPU RAM used: 8424 MB] ----------------------------- Start Iteration 1
[CPU RAM used: 8424 MB]   Using Max Alignment Radius 57.391 (4.495A)
[CPU RAM used: 8424 MB]   Using Full Dataset (split 99554 in A, 99553 in B)
[CPU RAM used: 8444 MB]   Current alpha values  (  0.53 |  0.89 |  1.00 |  1.11 |  1.94 )
[CPU RAM used: 8444 MB] -- THR 1 BATCH 500 NUM 13500 TOTAL 43.754398 ELAPSED 48.925306 --
Alignment map A
Alignment map B

========= sending heartbeat at 2024-07-12 13:57:58.305225
========= sending heartbeat at 2024-07-12 13:58:08.319061
========= sending heartbeat at 2024-07-12 13:58:18.337933
========= sending heartbeat at 2024-07-12 13:58:28.357190
========= sending heartbeat at 2024-07-12 13:58:38.368291
========= sending heartbeat at 2024-07-12 13:58:48.384898
========= sending heartbeat at 2024-07-12 13:58:58.405078
========= sending heartbeat at 2024-07-12 13:59:08.421672
========= sending heartbeat at 2024-07-12 13:59:18.440368
========= sending heartbeat at 2024-07-12 13:59:28.454406
========= sending heartbeat at 2024-07-12 13:59:38.466724
========= sending heartbeat at 2024-07-12 13:59:48.485521
========= sending heartbeat at 2024-07-12 13:59:58.500562
========= sending heartbeat at 2024-07-12 14:00:08.519740
========= sending heartbeat at 2024-07-12 14:00:18.537002
========= sending heartbeat at 2024-07-12 14:00:28.557599
========= sending heartbeat at 2024-07-12 14:00:38.570735
========= sending heartbeat at 2024-07-12 14:00:48.590649
========= sending heartbeat at 2024-07-12 14:00:58.604187
========= sending heartbeat at 2024-07-12 14:01:08.624615

Thanks @KyleBarrie for posting the information.
Do you know if the job uses particle caching and, if it does, how the particle cache is implemented:

  • an individual, local SSD device for each worker. An nvme ssd?
  • or a shared filesystem? If so, which type?

Hi wtempel,

This job does use particle caching. I believe that they are cached to an individual SSD and not a shared file system.

Best,
Kyle

Please can you post additional information:

  • outputs of these commands on the CryoSPARC master computer
    hostname -f
    free -h
    cryosparcm cli "get_scheduler_targets()"
    cryosparcm status | grep -e HOSTNAME -e PORT
    cat /sys/kernel/mm/transparent_hugepage/enabled
    nproc
    
  • outputs of these commands on the worker spgpu (if that computer is separate from the CryoSPARC master computer)
    hostname -f
    free -h
    cat /sys/kernel/mm/transparent_hugepage/enabled
    nproc
    

Please also let us know

  • if it is possible that either the master or worker computer may have ran out of physical RAM and started swapping when job J869 stalled
  • other jobs and/or the CryoSPARC web app also appeared to stall along with job J869

Hi wtempel,

Sorry for the late reply.

Upon looking into it more thoroughly, looks like one of our GPUs became faulty and that was causing the issue. We are sending it for repairs now and will update you if this has indeed fixed the issue.

Best,
Kyle