Particle extraction hangs after last micrograph

couch · June 24, 2020, 7:45pm

Hi,

We have been having a problem where the particle extraction step hangs intermittently (sometimes this completes without errors) after cryosparc reaches the final micrograph This is happening for both versions 2.15.1 and 2.16.1 (installed on Ubuntu 16.04 on the master node and Centos 7 on the worker nodes). We have recently configured our system to have 3 worker nodes connecting to a master node in a non-cluster setup. All other jobs work in this configuration, except for this error with particle extraction.
I have checked the extract directory in the job directory and I can see particle stacks present for the micrographs.

Here is the output for the extraction step.

[CPU: 3.56 GB]   (5776 of 5784) Finished processing micrograph 5778.
[CPU: 3.56 GB]   (5777 of 5784) Finished processing micrograph 5777.
[CPU: 3.56 GB]   (5778 of 5784) Finished processing micrograph 5781.
[CPU: 3.56 GB]   (5779 of 5784) Finished processing micrograph 5779.
[CPU: 3.56 GB]   (5780 of 5784) Finished processing micrograph 5780.
[CPU: 3.56 GB]   (5781 of 5784) Finished processing micrograph 5783.
[CPU: 3.56 GB]   (5782 of 5784) Finished processing micrograph 5782.

Here is the cryosparc job log:
================= CRYOSPARCW =======  2020-06-24 18:04:37.611009  =========
Project P36 Job J193
Master eagle Port 39002
===========================================================================
========= monitor process now starting main process
MAINPROCESS PID 30144
========= monitor process now waiting for main process
MAIN PID 30144
extract.run cryosparc2_compute.jobs.jobregister
========= sending heartbeat
========= sending heartbeat
***************************************************************
/app/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/skcuda/cublas.py:284: UserWarning: creating CUBLAS context to get version number
  warnings.warn('creating CUBLAS context to get version number')
/app/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/skcuda/cublas.py:284: UserWarning: creating CUBLAS context to get version number
  warnings.warn('creating CUBLAS context to get version number')
/app/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/skcuda/cublas.py:284: UserWarning: creating CUBLAS context to get version number
  warnings.warn('creating CUBLAS context to get version number')
/app/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/skcuda/cublas.py:284: UserWarning: creating CUBLAS context to get version number
  warnings.warn('creating CUBLAS context to get version number')
========= sending heartbeat
cryosparc2_compute/micrographs.py:405: RuntimeWarning: divide by zero encountered in divide
  return trim_mic(arr_zp_lp, out_shape) / trim_mic(ones_zp_lp, out_shape)
cryosparc2_compute/micrographs.py:405: RuntimeWarning: invalid value encountered in divide
  return trim_mic(arr_zp_lp, out_shape) / trim_mic(ones_zp_lp, out_shape)
cryosparc2_compute/micrographs.py:405: RuntimeWarning: divide by zero encountered in divide
  return trim_mic(arr_zp_lp, out_shape) / trim_mic(ones_zp_lp, out_shape)
cryosparc2_compute/micrographs.py:405: RuntimeWarning: invalid value encountered in divide
  return trim_mic(arr_zp_lp, out_shape) / trim_mic(ones_zp_lp, out_shape)
cryosparc2_compute/micrographs.py:405: RuntimeWarning: divide by zero encountered in divide
  return trim_mic(arr_zp_lp, out_shape) / trim_mic(ones_zp_lp, out_shape)
cryosparc2_compute/micrographs.py:405: RuntimeWarning: invalid value encountered in divide
  return trim_mic(arr_zp_lp, out_shape) / trim_mic(ones_zp_lp, out_shape)
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
/app/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/matplotlib/pyplot.py:522: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  max_open_warning, RuntimeWarning)
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat

Let me know if you have any suggestions on ways to fix this problem.

Thanks

stephan · June 25, 2020, 2:33pm

Hi @couch,

Is it possible if you can run this job in CPU mode (set Number of GPUs to parallelize (0 for CPU-only) parameter to 0) or with only 1 GPU, and let us know if this type of error still happens? We’re investigating this issue internally.

couch · June 25, 2020, 5:59pm

Hi Stephen,

Thanks for the response. I will try running the extraction job with 1 GPU and a separate job as 0 (CPU) and let you know if we still get this error.

couch · June 26, 2020, 3:07pm

We ran a few different jobs: multi-GPU, single GPU and CPU. Both the single GPU and CPU extraction jobs completed successfully, while the multi-GPU extract occasionally stalled after extraction from the last micrograph. For now, it seems like 1-GPU extraction is working so far for our tests.

olibclarke · July 23, 2020, 9:33pm

Hi @stephan we are also seeing this issue in the 2.16 beta - did you figure out the origin of this error?

Cheers
Oli

Liman · August 18, 2020, 4:16am

I had a similar problem that the “extraction particles from micrographs” job hangs forever, the output says there is no particles in the last micrograph, looks like the job got stuck there. I’m not sure if this is the real reason, but I did get it completed when I run the job with the last micrograph deleted.

heejongkim · August 31, 2020, 10:33pm

Hi,
I also encountered the same problem.
It seems it came along with some of recent updates or even patch.
I’m testing the same job with single GPU now and I will get back to here if that resolves the hanging issue.

heejongkim · September 1, 2020, 6:14am

And Yes. Single GPU solved the hanging problem with Current version: v2.15.0+200728.

abhipsa · January 10, 2025, 4:50pm

I encountered the same issue in v4.6.0. The extraction job hangs after processing the final micrograph (see attached image). I ran this job using 4 GPUs.

I am planning to run this with 1 GPU after coming across this troubleshooting tip. Let’s see how it goes.

But, did anyone figure out the origin of this error?

wtempel · January 10, 2025, 5:51pm

@abhipsa Please can you post the outputs of these commands

csprojectid=P99 # replace with actual project ID
csjobid=J199 # replace with ID of the failed job
cryosparcm cli "get_job('$csprojectid', '$csjobid', 'job_type', 'version', 'instance_information', 'status',  'params_spec', 'errors_run')"
cryosparcm joblog $csprojectid $csjobid | tail -n 40
cryosparcm eventlog $csprojectid $csjobid | sed '/Master running/q'

wtempel · January 10, 2025, 10:50pm

Thanks @abhipsa . What is the output of these commands on the chuspa computer:

cat /sys/kernel/mm/transparent_hugepage/enabled
sudo journalctl | grep -i oom

wtempel · January 13, 2025, 6:58pm

Thanks for the update @abhipsa. Please can you update this topic in case you observe the job hanging again even with transparent_hugepage.enabled set to [never] or, under CryoSPARC version 4.6.2 or newer, you observe the job hanging with transparent_hugepage.enabled set to [never] or [madvise].

tlevitz · March 11, 2025, 5:53pm

Hello,

Wanted to update that we are running into this issue even with transparent_hugepage.enabled set to [madvise] in 4.6.2. I currently encounter the hanging issue with any GPUs enabled but the job completes normally with CPUs only (see attached image – first two jobs had to be killed while hanging on the last step. The warning on the first job indicated that hugepages were enabled whereas the second two had the madvise setting).

Here is what our cluster admin sent along for your requests (run prior to trying changing the transparent_hugepage setting):

$ cryosparcm cli "get_job('P443','J39','job_type', 'version',
'instance_information', 'status',  'params_spec', 'errors_run')"
{'_id': '67d05db00408a872fa9b7da8', 'errors_run': [],
'instance_information': {'CUDA_version': '11.8', 'available_memory':
'453.33GB', 'cpu_model': 'Intel(R) Xeon(R) CPU E5-2697A v4 @ 2.60GHz',
'driver_version': '12.4', 'gpu_info': [{'id': 0, 'mem': 11707547648,
'name': 'NVIDIA GeForce GTX 1080 Ti', 'pcie': '0000:04:00'}, {'id': 1,
'mem': 11707547648, 'name': 'NVIDIA GeForce GTX 1080 Ti', 'pcie':
'0000:05:00'}, {'id': 2, 'mem': 12774539264, 'name': 'NVIDIA TITAN X
(Pascal)', 'pcie': '0000:08:00'}, {'id': 3, 'mem': 12774539264, 'name':
'NVIDIA TITAN X (Pascal)', 'pcie': '0000:09:00'}, {'id': 4, 'mem':
11707547648, 'name': 'NVIDIA GeForce GTX 1080 Ti', 'pcie': '0000:84:00'},
{'id': 5, 'mem': 11707547648, 'name': 'NVIDIA GeForce GTX 1080 Ti', 'pcie':
'0000:85:00'}, {'id': 6, 'mem': 12774539264, 'name': 'NVIDIA TITAN X
(Pascal)', 'pcie': '0000:88:00'}, {'id': 7, 'mem': 12774539264, 'name':
'NVIDIA TITAN X (Pascal)', 'pcie': '0000:89:00'}], 'ofd_hard_limit':
524288, 'ofd_soft_limit': 1024, 'physical_cores': 32,
'platform_architecture': 'x86_64', 'platform_node': 'ebony.in.hwlab',
'platform_release': '5.14.0-503.23.2.el9_5.x86_64', 'platform_version': '#1
SMP PREEMPT_DYNAMIC Wed Feb 12 05:52:18 EST 2025', 'total_memory':
'503.27GB', 'used_memory': '46.43GB'}, 'job_type':
'extract_micrographs_multi', 'params_spec': {'bin_size_pix': {'value':
128}, 'box_size_pix': {'value': 360}, 'compute_num_gpus': {'value': 2},
'num_extract': {'value': 10}, 'output_f16': {'value': True}},
'project_uid': 'P443', 'status': 'killed', 'uid': 'J39', 'version':
'v4.6.2'}

$ cryosparcm joblog  P443 J39| tail -n 40
========= sending heartbeat at 2025-03-11 12:08:53.661622
========= sending heartbeat at 2025-03-11 12:09:03.677307
========= sending heartbeat at 2025-03-11 12:09:13.693385
========= sending heartbeat at 2025-03-11 12:09:23.709084
========= sending heartbeat at 2025-03-11 12:09:33.725044
========= sending heartbeat at 2025-03-11 12:09:43.740866
========= sending heartbeat at 2025-03-11 12:09:53.755558
========= sending heartbeat at 2025-03-11 12:10:03.771115
========= sending heartbeat at 2025-03-11 12:10:13.787599
========= sending heartbeat at 2025-03-11 12:10:23.803472
========= sending heartbeat at 2025-03-11 12:10:33.819575
========= sending heartbeat at 2025-03-11 12:10:43.835356
========= sending heartbeat at 2025-03-11 12:10:53.851129
========= sending heartbeat at 2025-03-11 12:11:03.867426
========= sending heartbeat at 2025-03-11 12:11:13.882562
========= sending heartbeat at 2025-03-11 12:11:23.898192
========= sending heartbeat at 2025-03-11 12:11:33.914218
========= sending heartbeat at 2025-03-11 12:11:43.930046
========= sending heartbeat at 2025-03-11 12:11:53.945830
========= sending heartbeat at 2025-03-11 12:12:03.961179
========= sending heartbeat at 2025-03-11 12:12:13.976170
========= sending heartbeat at 2025-03-11 12:12:23.982638
========= sending heartbeat at 2025-03-11 12:12:33.998675
========= sending heartbeat at 2025-03-11 12:12:44.014479
========= sending heartbeat at 2025-03-11 12:12:54.023102
========= sending heartbeat at 2025-03-11 12:13:04.031891
========= sending heartbeat at 2025-03-11 12:13:14.047913
========= sending heartbeat at 2025-03-11 12:13:24.063990
========= sending heartbeat at 2025-03-11 12:13:34.070092
========= sending heartbeat at 2025-03-11 12:13:44.086475
========= sending heartbeat at 2025-03-11 12:13:54.102462
========= sending heartbeat at 2025-03-11 12:14:04.118217
========= sending heartbeat at 2025-03-11 12:14:14.134385
========= sending heartbeat at 2025-03-11 12:14:24.150354
========= sending heartbeat at 2025-03-11 12:14:34.167035
========= sending heartbeat at 2025-03-11 12:14:44.182605
========= sending heartbeat at 2025-03-11 12:14:54.193152
========= sending heartbeat at 2025-03-11 12:15:04.208984
========= sending heartbeat at 2025-03-11 12:15:14.219881
Terminated

$ cryosparcm eventlog P443 J39 | sed '/Master running/q'
[Tue, 11 Mar 2025 15:58:51 GMT]  License is valid.
[Tue, 11 Mar 2025 15:58:51 GMT]  Launching job on lane default target
ebony.in.hwlab ...
[Tue, 11 Mar 2025 15:58:51 GMT]  Running job on master node hostname
ebony.in.hwlab
[Tue, 11 Mar 2025 15:58:52 GMT] [CPU RAM used: 90 MB] Job J39 Started
[Tue, 11 Mar 2025 15:58:52 GMT] [CPU RAM used: 90 MB] Master running
v4.6.2, worker running v4.6.2

Nothing unusual here.


We had huge pages enabled -
$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never

Didn't have any OOM events though -

# journalctl | grep -i oom |grep -v ansible

Let me know if you need any more info. Thank you!

wtempel · March 11, 2025, 6:38pm

@tlevitz Thanks for reporting. Please can you email us the job reports (zip archives) for J38…J40?

tlevitz · March 11, 2025, 7:50pm

@wtempel I have some more information for our specific situation that may be helpful and change what you need. I was queuing by accident on our “default” lane (which is not, in fact, the default lane we use). Our default lane has a 1080Ti GPU, whereas the lane we usually use (“dfcicluster”) has a mix of 1080Ti and A4500 GPUs. If I send it to an A4500 GPU, the job completes normally. If I send it to a 1080-Ti GPU (either through slurm or directly, which is the difference between “default lane” and “cluster-1080Ti”), the job hangs. We do not have this issue if we run other GPU-enabled jobs (e.g. a 2D classification job) on the 1080Ti lane. Thoughts? I can send the job reports for the old jobs and/or these new ones if it is still helpful, just let me know.

wtempel · March 11, 2025, 9:04pm

Interesting. @tlevitz May I ask

Were J52 and J53 sent to the same cluster lane, and ended up running on different types of GPUs?
Did J52 and J53 run with THP set to [always]?
Are the default lane nodes also part of the cluster?
Do 1080Ti nodes typically or potentially run multiple jobs simultaneously?

tlevitz · March 12, 2025, 5:08pm

@wtempel

J52 and J53 were sent to different lanes, and were on different types of GPUs. The lane that J53 was sent to contains all of the GPUs that the lane that J52 was sent for in addition to other GPUs. J53 ran on one of the other GPUs.
J52 and J53 specifically were run with THP set to [always], but when we changed to [madvise], this did not change anything in the behavior of identical jobs.
The default lane nodes are part of the cluster but are not run through the slurm queuing system. We actually just got rid of the default lane entirely to avoid confusion.
Each 1080Ti GPU only is assigned to one job at once.

It appears, however, that the issue is not with the 1080Ti GPUs themselves but with how one of our workers is set up. If we run the job on a 1080Ti GPU that is attached to a different worker, it runs fine. However if it is sent to one worker (which is also where the master is set up; this was the default lane and is also included in the 1080Ti lane), we get the hanging issue and when we kill the job (or other CS jobs on that worker) the worker goes into drain mode. I did see another thread on that issue specifically and my cluster administrator is looking more into it, so it might be an issue adjacent to the extraction job but not caused by the extraction job.

I will update if we find anything helpful.

tlevitz · March 12, 2025, 9:07pm

Update for now, the worker that is hanging has at least one GPU that is stuck (doesn’t show any use but has some system processes stuck). It needs a reboot but for various reasons probably won’t be rebooted for a bit. I will update if we have the same issue post-reboot.

tlevitz · March 27, 2025, 2:02pm

Last update, once our worker and its GPU was rebooted, everything seems to be working fine. Thank you for your help troubleshooting!