2D classification abnormal termination

Hi,
I am getting different errors when running 2D classification in cryosparc v2.14.2. A snapshot is attached and another log file as well. I am not sure if the underlining cause is the same. The machine has 64 GB of RAM, box size is 384. Similar jobs with more particles ran normally earlier with the same parameters.

================= CRYOSPARCW =======  2020-03-09 17:25:20.172376  =========
Project P22 Job J114
Master cryosparc.host.utmb.edu Port 39002
===========================================================================
========= monitor process now starting main process
MAINPROCESS PID 31292
========= monitor process now waiting for main process
MAIN PID 31292
class2D.run cryosparc2_compute.jobs.jobregister
/mnt/ape2/cryosparc/software/cryosparc/cryosparc2_worker-v2.14.2/deps/anaconda/lib/python2.7/site-packages/skcuda/cublas.py:284: UserWarning: creating CUBLAS context to get version number
  warnings.warn('creating CUBLAS context to get version number')
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
***************************************************************
Running job  J114  of type  class_2D
Running job on hostname %s vds1-2.utmb.edu
Allocated Resources :  {u'lane': u'vds12', u'target': {u'monitor_port': None, u'lane': u'vds12', u'name': u'vds1-2.utmb.edu', u'title': u'Worker node vds1-2.utmb.edu', u'resource_slots': {u'GPU': [0], u'RAM': [0, 1, 2, 3, 4, 5, 6, 7], u'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]}, u'hostname': u'vds1-2.utmb.edu', u'worker_bin_path': u'/mnt/ape2/cryosparc/software/cryosparc/cryosparc2_worker-v2.14.2/bin/cryosparcw', u'cache_path': u'/mnt/scratch/cryosparc_cache', u'cache_quota_mb': None, u'resource_fixed': {u'SSD': True}, u'cache_reserve_mb': 10000, u'type': u'node', u'ssh_str': u'cryosparc@vds1-2.utmb.edu', u'desc': None}, u'license': True, u'hostname': u'vds1-2.utmb.edu', u'slots': {u'GPU': [0], u'RAM': [0, 1, 2], u'CPU': [0, 1]}, u'fixed': {u'SSD': True}, u'lane_type': u'vds12', u'licenses_acquired': 1}
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
/mnt/ape2/cryosparc/software/cryosparc/cryosparc2_worker-v2.14.2/deps/anaconda/lib/python2.7/site-packages/numpy/core/_methods.py:59: RuntimeWarning: Mean of empty slice.
  warnings.warn("Mean of empty slice.", RuntimeWarning)
/mnt/ape2/cryosparc/software/cryosparc/cryosparc2_worker-v2.14.2/deps/anaconda/lib/python2.7/site-packages/numpy/core/_methods.py:70: RuntimeWarning: invalid value encountered in true_divide
  ret = ret.dtype.type(ret / rcount)
cryosparc2_compute/sigproc.py:771: RuntimeWarning: invalid value encountered in divide
  frc[k, :copylen] = (AB / n.sqrt(AA*BB))[:copylen]
cryosparc2_compute/sigproc.py:838: RuntimeWarning: invalid value encountered in greater
  crossings = n.where((fsc[:-1] > thresh) * (fsc[1:] < thresh))[0]
cryosparc2_compute/sigproc.py:838: RuntimeWarning: invalid value encountered in less
  crossings = n.where((fsc[:-1] > thresh) * (fsc[1:] < thresh))[0]
========= sending heartbeat
========= sending heartbeat
cryosparc2_compute/util/logsumexp.py:40: RuntimeWarning: divide by zero encountered in log
  return n.log(wa * n.exp(a - vmax) + wb * n.exp(b - vmax) ) + vmax
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
/mnt/ape2/cryosparc/software/cryosparc/cryosparc2_worker-v2.14.2/deps/anaconda/lib/python2.7/site-packages/matplotlib/pyplot.py:516: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  max_open_warning, RuntimeWarning)
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
cryosparc2_compute/sigproc.py:771: RuntimeWarning: divide by zero encountered in divide
  frc[k, :copylen] = (AB / n.sqrt(AA*BB))[:copylen]
cryosparc2_compute/sigproc.py:846: RuntimeWarning: invalid value encountered in double_scalars
  x = (thresh - fa) * (b-a) / (fb - fa) + a
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= main process now complete.
========= monitor process now complete.

Jobs fail towards the end at one of the last iterations.
Thanks,
Michael

Hi @mbs, are you still having this problem? Were you able to run the job to completion since?

Hi @apunjani, yes I do. Here is another example of log file:

================= CRYOSPARCW =======  2020-04-01 15:16:44.739737  =========
Project P18 Job J87
Master cryosparc.host.utmb.edu Port 39002
===========================================================================
========= monitor process now starting main process
MAINPROCESS PID 26840
========= monitor process now waiting for main process
MAIN PID 26840
class2D.run cryosparc2_compute.jobs.jobregister
/mnt/ape2/cryosparc/software/cryosparc/cryosparc2_worker-v2.14.2/deps/anaconda/lib/python2.7/site-packages/skcuda/cublas.py:284: UserWarning: creating CUBLAS context to get version number
  warnings.warn('creating CUBLAS context to get version number')
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
.
.
.
/mnt/ape2/cryosparc/software/cryosparc/cryosparc2_worker-v2.14.2/deps/anaconda/l
ib/python2.7/site-packages/numpy/core/_methods.py:59: RuntimeWarning: Mean of em
pty slice.
  warnings.warn("Mean of empty slice.", RuntimeWarning)
/mnt/ape2/cryosparc/software/cryosparc/cryosparc2_worker-v2.14.2/deps/anaconda/l
ib/python2.7/site-packages/numpy/core/_methods.py:70: RuntimeWarning: invalid value encountered in true_divide
  ret = ret.dtype.type(ret / rcount)
cryosparc2_compute/sigproc.py:771: RuntimeWarning: invalid value encountered in divide
  frc[k, :copylen] = (AB / n.sqrt(AA*BB))[:copylen]
cryosparc2_compute/sigproc.py:838: RuntimeWarning: invalid value encountered in greater
  crossings = n.where((fsc[:-1] > thresh) * (fsc[1:] < thresh))[0]
cryosparc2_compute/sigproc.py:838: RuntimeWarning: invalid value encountered in less
  crossings = n.where((fsc[:-1] > thresh) * (fsc[1:] < thresh))[0]
========= sending heartbeat
========= sending heartbeat
.
.
.
========= sending heartbeat
cryosparc2_compute/util/logsumexp.py:40: RuntimeWarning: divide by zero encountered in log
  return n.log(wa * n.exp(a - vmax) + wb * n.exp(b - vmax) ) + vmax
========= sending heartbeat
========= sending heartbeat
.
.
/mnt/ape2/cryosparc/software/cryosparc/cryosparc2_worker-v2.14.2/deps/anaconda/lib/python2.7/site-packages/matplotlib/pyplot.py:516: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  max_open_warning, RuntimeWarning)
========= sending heartbeat
========= sending heartbeat
.
.
.
========= sending heartbeat
========= sending heartbeat
/mnt/ape2/cryosparc/software/cryosparc/cryosparc2_worker-v2.14.2/deps/anaconda/l
ib/python2.7/site-packages/matplotlib/pyplot.py:516: RuntimeWarning: More than 2
0 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  max_open_warning, RuntimeWarning)
========= sending heartbeat
========= sending heartbeat
.
.
.
========= sending heartbeat
========= sending heartbeat
/mnt/ape2/cryosparc/software/cryosparc/cryosparc2_worker-v2.14.2/deps/anaconda/l
ib/python2.7/site-packages/matplotlib/pyplot.py:516: RuntimeWarning: More than 2
0 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  max_open_warning, RuntimeWarning)
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
.
.
.
========= sending heartbeat
========= sending heartbeat
/mnt/ape2/cryosparc/software/cryosparc/cryosparc2_worker-v2.14.2/deps/anaconda/lib/python2.7/site-packages/matplotlib/pyplot.py:516: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  max_open_warning, RuntimeWarning)
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
cryosparc2_compute/sigproc.py:771: RuntimeWarning: divide by zero encountered in divide
  frc[k, :copylen] = (AB / n.sqrt(AA*BB))[:copylen]
cryosparc2_compute/sigproc.py:846: RuntimeWarning: invalid value encountered in double_scalars
  x = (thresh - fa) * (b-a) / (fb - fa) + a
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
.
.
.
========= sending heartbeat
========= sending heartbeat
========= main process now complete.
========= monitor process now complete.

At that point in the job window I get: [CPU: 84.9 MB] ====== Job process terminated abnormally.
This time the error is reproducible, happens at 8th iteration.
Michael

Hi @mbs,

This is truly a bizarre case because there seem to be no job-termination errors in this log… there are some warnings, but those are common and do not affect the execution of the job.

Is this running on a cluster system? One possibility is that the cluster system is strictly enforcing memory and CPU limits and in later iterations the jobs are using slightly more memory than they say they will, and the cluster system then directly terminates the job process without warning.

If this is not the case, the only way we can debug is to try to reproduce the issue here. Would you be ale to share the data?

Hi @apunjani,
No, that is a single workstation with one GPU and a SSD. Yes, I could share the data with you. It would be good to get to the bottom of it.
Michael

Hi,

This this issue get resolved. I have the same error and an always working on a single workstation with an SSD, but 4 GPUs.

This is the error I repeatedly get:

[CPU: 99.5 MB] Project P1 Job J186 Started

[CPU: 99.5 MB] Master running v2.15.0, worker running v2.15.0

[CPU: 99.8 MB] Running on lane default

[CPU: 99.8 MB] Resources allocated:

[CPU: 99.8 MB] Worker: localhost

[CPU: 99.8 MB] CPU : [0, 1]

[CPU: 99.8 MB] GPU : [0, 1, 2]

[CPU: 99.8 MB] RAM : [0, 1, 2]

[CPU: 99.8 MB] SSD : False

[CPU: 99.8 MB] --------------------------------------------------------------

[CPU: 99.8 MB] Importing job module for job type class_2D…

[CPU: 512.3 MB] Job ready to run

[CPU: 512.3 MB] ***************************************************************

[CPU: 83.6 MB] ====== Job process terminated abnormally.

And if I check the joblog this is it, there is no clear error:

================= CRYOSPARCW ======= 2020-09-24 09:24:27.688578 =========
Project P1 Job J186
Master localhost Port 39002

========= monitor process now starting main process
MAINPROCESS PID 262080
========= monitor process now waiting for main process
MAIN PID 262080
class2D.run cryosparc2_compute.jobs.jobregister
/home/myfry/software/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/skcuda/cublas.py:284: UserWarning: creating CUBLAS context to get version number
warnings.warn(‘creating CUBLAS context to get version number’)
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= main process now complete.
========= monitor process now complete.

Thanks,
Michelle

Hi Michelle,

I have a workstation with 4 GPUs and I get the same error message all the time at random points of the jobs. Did you figure out a solution? I’d be very grateful if you could share.
Thanks,
Yashar

Hi Yashar,

If I remember correctly it was a memory problem. My labmate was running motion correction jobs at the same time, but on different GPUs.

Best,
Michelle

Hi,
I am getting a similar error message while running 2D classification in cryosparc v3.2.0. Attached is the screenshot of the error file. The job has about 3.2 million particles. I don’t think the size of the particles could be the cause since jobs with more particles ran successfully earlier with the exact same parameters.


Can someone please help?
Thank you
Abhipsa

Hi Abhipsa,

Usually this means that the file is corrupt on disk. Try using the “check for corrupt particles” job - that should identify whether or not some of your particle stacks are unreadable.

–Harris

@abhipsa You noted that you’re running version 3.2. Did you consider updating cryoSPARC to the latest version, in addition to following @hsnyder’s suggestion?

1 Like

Hi Harris,
Thanks for the help but unfortunately I do not find the above-mentioned job under utilities section (image attached). I guess that’s because I am still using v3.2.0. Is there any other way I can check for the presence of corrupt files?
-Abhipsa!

@wtempel Yes, we are still running the older version. We will update it soon.