Particle extraction hangs after last micrograph

Hi,

We have been having a problem where the particle extraction step hangs intermittently (sometimes this completes without errors) after cryosparc reaches the final micrograph This is happening for both versions 2.15.1 and 2.16.1 (installed on Ubuntu 16.04 on the master node and Centos 7 on the worker nodes). We have recently configured our system to have 3 worker nodes connecting to a master node in a non-cluster setup. All other jobs work in this configuration, except for this error with particle extraction.
I have checked the extract directory in the job directory and I can see particle stacks present for the micrographs.

Here is the output for the extraction step.

[CPU: 3.56 GB]   (5776 of 5784) Finished processing micrograph 5778.
[CPU: 3.56 GB]   (5777 of 5784) Finished processing micrograph 5777.
[CPU: 3.56 GB]   (5778 of 5784) Finished processing micrograph 5781.
[CPU: 3.56 GB]   (5779 of 5784) Finished processing micrograph 5779.
[CPU: 3.56 GB]   (5780 of 5784) Finished processing micrograph 5780.
[CPU: 3.56 GB]   (5781 of 5784) Finished processing micrograph 5783.
[CPU: 3.56 GB]   (5782 of 5784) Finished processing micrograph 5782.
Here is the cryosparc job log:
================= CRYOSPARCW =======  2020-06-24 18:04:37.611009  =========
Project P36 Job J193
Master eagle Port 39002
===========================================================================
========= monitor process now starting main process
MAINPROCESS PID 30144
========= monitor process now waiting for main process
MAIN PID 30144
extract.run cryosparc2_compute.jobs.jobregister
========= sending heartbeat
========= sending heartbeat
***************************************************************
/app/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/skcuda/cublas.py:284: UserWarning: creating CUBLAS context to get version number
  warnings.warn('creating CUBLAS context to get version number')
/app/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/skcuda/cublas.py:284: UserWarning: creating CUBLAS context to get version number
  warnings.warn('creating CUBLAS context to get version number')
/app/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/skcuda/cublas.py:284: UserWarning: creating CUBLAS context to get version number
  warnings.warn('creating CUBLAS context to get version number')
/app/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/skcuda/cublas.py:284: UserWarning: creating CUBLAS context to get version number
  warnings.warn('creating CUBLAS context to get version number')
========= sending heartbeat
cryosparc2_compute/micrographs.py:405: RuntimeWarning: divide by zero encountered in divide
  return trim_mic(arr_zp_lp, out_shape) / trim_mic(ones_zp_lp, out_shape)
cryosparc2_compute/micrographs.py:405: RuntimeWarning: invalid value encountered in divide
  return trim_mic(arr_zp_lp, out_shape) / trim_mic(ones_zp_lp, out_shape)
cryosparc2_compute/micrographs.py:405: RuntimeWarning: divide by zero encountered in divide
  return trim_mic(arr_zp_lp, out_shape) / trim_mic(ones_zp_lp, out_shape)
cryosparc2_compute/micrographs.py:405: RuntimeWarning: invalid value encountered in divide
  return trim_mic(arr_zp_lp, out_shape) / trim_mic(ones_zp_lp, out_shape)
cryosparc2_compute/micrographs.py:405: RuntimeWarning: divide by zero encountered in divide
  return trim_mic(arr_zp_lp, out_shape) / trim_mic(ones_zp_lp, out_shape)
cryosparc2_compute/micrographs.py:405: RuntimeWarning: invalid value encountered in divide
  return trim_mic(arr_zp_lp, out_shape) / trim_mic(ones_zp_lp, out_shape)
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
/app/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/matplotlib/pyplot.py:522: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  max_open_warning, RuntimeWarning)
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat

Let me know if you have any suggestions on ways to fix this problem.

Thanks

Hi @couch,

Is it possible if you can run this job in CPU mode (set Number of GPUs to parallelize (0 for CPU-only) parameter to 0) or with only 1 GPU, and let us know if this type of error still happens? We’re investigating this issue internally.

Hi Stephen,

Thanks for the response. I will try running the extraction job with 1 GPU and a separate job as 0 (CPU) and let you know if we still get this error.

We ran a few different jobs: multi-GPU, single GPU and CPU. Both the single GPU and CPU extraction jobs completed successfully, while the multi-GPU extract occasionally stalled after extraction from the last micrograph. For now, it seems like 1-GPU extraction is working so far for our tests.