Select 2D - Job process terminated abnormally

Hello, I am trying to run a Select2D job and hit this error (which doesn’t really have any information):

[CPU: 68.0 MB] Importing job module for job type select_2D…
[CPU: 186.3 MB] Job ready to run
[CPU: 186.3 MB] ***************************************************************
[CPU: 186.5 MB] Loaded info for 50 classes
[CPU: 69.4 MB] ====== Job process terminated abnormally.

I then tried to stop, restart, try again - same error.
I then tried to stop, update to v 3.2.0, restart, try again - same error.

Some info:
Centos 7, cryoSPARC 3.2.0, NVIDIA driver 460.67, CUDA version 11.2.

Can someone advise on what is happening?

I found that maybe the something with the update could have gone wrong so I tried to force update and received this error:

CryoSPARC current version v3.2.0
update starting on Thu Apr 1 12:15:39 PDT 2021

No version specified - updating to latest version.

=============================
Forcing update to version v3.2.0…

CryoSPARC is not already running.
If you would like to restart, use cryosparcm restart
Removing previous downloads…
Downloading master update…
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:–:-- --:–:-- --:–:-- 0
100 785M 100 785M 0 0 5326k 0 0:02:30 0:02:30 --:–:-- 4152k
Downloading worker update…
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:–:-- --:–:-- --:–:-- 0
100 1807M 100 1807M 0 0 10.9M 0 0:02:45 0:02:45 --:–:-- 16.7M
Done.

Update will now be applied to the master installation,
followed by worker installations on other nodes.

Deleting old files…
Extracting…
tar: Skipping to next header
tar: Exiting with failure status due to previous errors

The updated version will still start normally (as v 3.2.0)

Hi there,
I am representing users from research computing in a major university in California.

I want to chime in as we are experiencing the same issue, with the same versions of software:
Centos 7, cryoSPARC 3.2.0, NVIDIA driver 460.67, CUDA version 11.2.

The output in the main job window in a web browser:

[CPU: 6.10 GB]   Particles selected : 4798732
[CPU: 6.10 GB]   Particles excluded : 9319942
[CPU: 6.11 GB]   Done.
[CPU: 6.11 GB]   Interactive backend shutting down.
[CPU: 4.39 GB]   --------------------------------------------------------------
[CPU: 4.39 GB]   Compiling job outputs...
[CPU: 4.39 GB]   Passing through outputs for output group particles_selected from input group particles
[CPU: 6.60 GB]   This job outputted results ['blob', 'alignments2D']
[CPU: 6.60 GB]     Loaded output dset with 4798732 items
[CPU: 6.60 GB]   Passthrough results ['ctf', 'location', 'pick_stats']
[CPU: 11.86 GB]    Loaded passthrough dset with 14118674 items
[CPU: 10.75 GB]    Intersection of output and passthrough has 4798732 items
[CPU: 10.75 GB]  Passing through outputs for output group particles_excluded from input group particles
[CPU: 10.78 GB]  This job outputted results ['blob', 'alignments2D']
[CPU: 10.78 GB]    Loaded output dset with 9319942 items
[CPU: 10.78 GB]  Passthrough results ['ctf', 'location', 'pick_stats']
[CPU: 55.9 MB]   ====== Job process terminated abnormally.

However job.log file does not say anything useful:

================= CRYOSPARCW =======  2021-09-01 12:09:20.473610  =========
Project P24 Job J472
Master xxx.xxx.xxx Port 39002
===========================================================================
========= monitor process now starting main process
MAINPROCESS PID 47428
========= monitor process now waiting for main process
MAIN PID 47428
select2D.run cryosparc_compute.jobs.jobregister
========= sending heartbeat
***************************************************************
INTERACTIVE JOB STARTED ===  2021-09-01 12:09:36.769559  ==========================
========= sending heartbeat
 * Serving Flask app "select_2D" (lazy loading)
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.
 * Debug mode: off
========= sending heartbeat
========= main process now complete.
========= monitor process now complete.

I see @stephan providing lots of help - can you help us?

Hi @osinskit,

Thanks for reporting. It seems like you’re processing a lot of particles. Could the master node be running out of memory and terminating the python process? Try monitoring RAM usage while this job is completing- the join operation at the end of the job (which is where your job seems to have died) could be the culprit here.

Thank you, @stephan. I really appreciate your help and advice.
In fact, I already monitored the memory usage roughly with “top”, and the process used at most 20GB of RAM which is well within the 256GB available. After reaching 20.2GB, the main process died.
Is there a way to make cryosparc produce more output to pinpoint the exact moment and debug the issue easier?
It would be amazing if we made the Cryosparc tackle that dataset, as it is one of many such datasets to come.