Select 2D - Job process terminated abnormally

cazumaya · April 1, 2021, 7:27pm

Hello, I am trying to run a Select2D job and hit this error (which doesn’t really have any information):

[CPU: 68.0 MB] Importing job module for job type select_2D…
[CPU: 186.3 MB] Job ready to run
[CPU: 186.3 MB] ***************************************************************
[CPU: 186.5 MB] Loaded info for 50 classes
[CPU: 69.4 MB] ====== Job process terminated abnormally.

I then tried to stop, restart, try again - same error.
I then tried to stop, update to v 3.2.0, restart, try again - same error.

Some info:
Centos 7, cryoSPARC 3.2.0, NVIDIA driver 460.67, CUDA version 11.2.

Can someone advise on what is happening?

cazumaya · April 1, 2021, 7:27pm

I found that maybe the something with the update could have gone wrong so I tried to force update and received this error:

CryoSPARC current version v3.2.0
update starting on Thu Apr 1 12:15:39 PDT 2021

No version specified - updating to latest version.

=============================
Forcing update to version v3.2.0…

CryoSPARC is not already running.
If you would like to restart, use cryosparcm restart
Removing previous downloads…
Downloading master update…
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:–:-- --:–:-- --:–:-- 0
100 785M 100 785M 0 0 5326k 0 0:02:30 0:02:30 --:–:-- 4152k
Downloading worker update…
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:–:-- --:–:-- --:–:-- 0
100 1807M 100 1807M 0 0 10.9M 0 0:02:45 0:02:45 --:–:-- 16.7M
Done.

Update will now be applied to the master installation,
followed by worker installations on other nodes.

Deleting old files…
Extracting…
tar: Skipping to next header
tar: Exiting with failure status due to previous errors

The updated version will still start normally (as v 3.2.0)

osinskit · September 1, 2021, 7:46pm

Hi there,
I am representing users from research computing in a major university in California.

I want to chime in as we are experiencing the same issue, with the same versions of software:
Centos 7, cryoSPARC 3.2.0, NVIDIA driver 460.67, CUDA version 11.2.

The output in the main job window in a web browser:

[CPU: 6.10 GB]   Particles selected : 4798732
[CPU: 6.10 GB]   Particles excluded : 9319942
[CPU: 6.11 GB]   Done.
[CPU: 6.11 GB]   Interactive backend shutting down.
[CPU: 4.39 GB]   --------------------------------------------------------------
[CPU: 4.39 GB]   Compiling job outputs...
[CPU: 4.39 GB]   Passing through outputs for output group particles_selected from input group particles
[CPU: 6.60 GB]   This job outputted results ['blob', 'alignments2D']
[CPU: 6.60 GB]     Loaded output dset with 4798732 items
[CPU: 6.60 GB]   Passthrough results ['ctf', 'location', 'pick_stats']
[CPU: 11.86 GB]    Loaded passthrough dset with 14118674 items
[CPU: 10.75 GB]    Intersection of output and passthrough has 4798732 items
[CPU: 10.75 GB]  Passing through outputs for output group particles_excluded from input group particles
[CPU: 10.78 GB]  This job outputted results ['blob', 'alignments2D']
[CPU: 10.78 GB]    Loaded output dset with 9319942 items
[CPU: 10.78 GB]  Passthrough results ['ctf', 'location', 'pick_stats']
[CPU: 55.9 MB]   ====== Job process terminated abnormally.

However job.log file does not say anything useful:

================= CRYOSPARCW =======  2021-09-01 12:09:20.473610  =========
Project P24 Job J472
Master xxx.xxx.xxx Port 39002
===========================================================================
========= monitor process now starting main process
MAINPROCESS PID 47428
========= monitor process now waiting for main process
MAIN PID 47428
select2D.run cryosparc_compute.jobs.jobregister
========= sending heartbeat
***************************************************************
INTERACTIVE JOB STARTED ===  2021-09-01 12:09:36.769559  ==========================
========= sending heartbeat
 * Serving Flask app "select_2D" (lazy loading)
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.
 * Debug mode: off
========= sending heartbeat
========= main process now complete.
========= monitor process now complete.

I see @stephan providing lots of help - can you help us?

stephan · September 1, 2021, 8:20pm

Hi @osinskit,

Thanks for reporting. It seems like you’re processing a lot of particles. Could the master node be running out of memory and terminating the python process? Try monitoring RAM usage while this job is completing- the join operation at the end of the job (which is where your job seems to have died) could be the culprit here.

osinskit · September 1, 2021, 11:12pm

Thank you, @stephan. I really appreciate your help and advice.
In fact, I already monitored the memory usage roughly with “top”, and the process used at most 20GB of RAM which is well within the 256GB available. After reaching 20.2GB, the main process died.
Is there a way to make cryosparc produce more output to pinpoint the exact moment and debug the issue easier?
It would be amazing if we made the Cryosparc tackle that dataset, as it is one of many such datasets to come.

iphan · March 30, 2022, 4:09am

Has this been solved? I am getting the same ‘Job process terminated abnormally’ for Inspect Particle Picks and can’t move forward. There’s no specific error.

AWS Linux 2 (Karoo), CryoSparc v3.3.1, NVIDIA 460.73.01, CUDA 11.3

Log:

[CPU: 3.00 GB] ==== Completed. Extracted 441808 particles.
[CPU: 3.00 GB] Interactive backend shutting down.
[CPU: 2.91 GB] --------------------------------------------------------------
[CPU: 2.91 GB] Compiling job outputs…
[CPU: 2.91 GB] Passing through outputs for output group micrographs from input group micrographs
[CPU: 2.91 GB] This job outputted results [‘micrograph_blob’]
[CPU: 2.91 GB] Loaded output dset with 4828 items
[CPU: 2.91 GB] Passthrough results [‘ctf’, ‘mscope_params’, ‘background_blob’, ‘movie_blob’, ‘ctf_stats’, ‘rigid_motion’, ‘spline_motion’, ‘micrograph_blob_non_dw’, ‘micrograph_thumbnail_blob_1x’, ‘micrograph_thumbnail_blob_2x’, ‘gain_ref_blob’]
[CPU: 2.92 GB] Loaded passthrough dset with 4828 items
[CPU: 2.92 GB] Intersection of output and passthrough has 4828 items
[CPU: 2.92 GB] Passing through outputs for output group particles from input group particles
[CPU: 3.00 GB] This job outputted results [‘location’]
[CPU: 3.00 GB] Loaded output dset with 441808 items
[CPU: 3.00 GB] Passthrough results [‘pick_stats’, ‘ctf’]
[CPU: 57.0 MB] ====== Job process terminated abnormally.

wtempel · March 30, 2022, 9:13pm

@iphan I do not know what caused the termination of this job. Does the output of
cryosparcm joblog <project_id> <job_id>
provide additional details?
Is it possible that the job was terminated for exceeding physical or administrative (such as ulimit, cgroups, hypervisor) memory limits?

iphan · March 30, 2022, 9:40pm

@wtempel thanks for looking into this, appreciated!

cryosparcm joblog shows nothing:
========= sending heartbeat
========= main process now complete.
========= monitor process now complete.

Logs show job usage < 3GB and total mem on server is 8GB.

$ ulimit -v
unlimited
$ systemctl show cryosparc.service | grep '^Memory'
**Memory**Current=18446744073709551615
**Memory**Accounting=no
**Memory**Limit=18446744073709551615

Reducing to 300K particles produces the same error. I picked that many before and had no problem.

200K particles: same error. Something is wrong with cryosparc.

$ free -h
              total        used        free      shared  buff/cache   available
Mem:           7.4G        2.7G        4.6G        612K        207M        4.6G

iphan · March 30, 2022, 10:22pm

The job’s CPU always appears to show the same value regardless of how many particles I pick. Looking at my previous jobs, this does not feel right.

Failed jobs:
[CPU: 2.89 GB] ==== Completed. Extracted 229009 particles.
[CPU: 2.88 GB] ==== Completed. Extracted 308615 particles.
[CPU: 2.89 GB] ==== Completed. Extracted 558777 particles.

Contrast with old Inspect Particle Picks jobs that completed:
[CPU: 1.19 GB] ==== Completed. Extracted 187686 particles.
[CPU: 1.36 GB] ==== Completed. Extracted 310109 particles.

wtempel · March 31, 2022, 4:18pm

Please can you test the behavior after increasing DRAM to 16 GB+, rerun the earlier job

iphan:

[CPU: 3.00 GB] ==== Completed. Extracted 441808 particles.
[CPU: 3.00 GB] Interactive backend shutting down.
[CPU: 2.91 GB] --------------------------------------------------------------
[CPU: 2.91 GB] Compiling job outputs…
[CPU: 2.91 GB] Passing through outputs for output group micrographs from input group micrographs
[CPU: 2.91 GB] This job outputted results [‘micrograph_blob’]
[CPU: 2.91 GB] Loaded output dset with 4828 items
[CPU: 2.91 GB] Passthrough results [‘ctf’, ‘mscope_params’, ‘background_blob’, ‘movie_blob’, ‘ctf_stats’, ‘rigid_motion’, ‘spline_motion’, ‘micrograph_blob_non_dw’, ‘micrograph_thumbnail_blob_1x’, ‘micrograph_thumbnail_blob_2x’, ‘gain_ref_blob’]
[CPU: 2.92 GB] Loaded passthrough dset with 4828 items
[CPU: 2.92 GB] Intersection of output and passthrough has 4828 items
[CPU: 2.92 GB] Passing through outputs for output group particles from input group particles
[CPU: 3.00 GB] This job outputted results [‘location’]
[CPU: 3.00 GB] Loaded output dset with 441808 items
[CPU: 3.00 GB] Passthrough results [‘pick_stats’, ‘ctf’]
[CPU: 57.0 MB] ====== Job process terminated abnormally.

and post the output.

iphan · March 31, 2022, 6:20pm

My previous post shows that picking a smaller number of particles is now failing, i.e. 229K (failed job) vs 310K (previous successful job). Both Inspect Particle Picks jobs were run on the same AWS instance type: master_instance_type = c5.xlarge

Please can you explain why increasing DRAM would help? I’d like to understand the logic of doing this.

It is not straightforward (and slooow!) for me to do the backup, increase DRAM and redeploy my entire setup on AWS.

iphan · March 31, 2022, 11:00pm

Increasing DRAM to 16 GB did it. FYI I am testing on the same set of micrographs, extracting the same number of particles. I don’t understand why suddenly we need twice the memory.

Moving to larger RAM doubles the cost, so I would really appreciate it if you could explain why we need more RAM to do the same job?

Is there a way to optimize cryosparc to get back to where it needed less resources?

Select 2D - Job process terminated abnormally

============================= Forcing update to version v3.2.0…

=============================
Forcing update to version v3.2.0…