Marking 3D Class runs as complete not working (v4)

Hi,

I have 3D classification jobs (v4) with “output every F-EM cycle” set to true. For all of them so far, marking as complete after killing the job has not worked, even after restarting cryosparc. See screenshot of event log, and snip from joblog, below.

Killing and marking as complete works fine for other job types, including 3DClass with “output every F-EM cycle” set to false.

It’s possible they are taking a really long time to complete I guess, but one of them has been “completing” for 12 hrs which seems a little on the long side. EDIT: the same job has not completed after 24hrs, so I think there is something wrong here.

Cheers
Oli

========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
/home/exx/cryosparc/cryosparc_worker/bin/cryosparcw: line 134: 86859 Terminated              python -c "import cryosparc_compute.run as run; run.run()" "$@"

Thank you for reporting this issue. Please can you download and email us the “job report” archive, which is available via the button at the tip of the green arrow.

1 Like

Hi @wtempel,

I just tried that and it doesn’t seem to work - I just get the spinning arrows and it doesn’t download…

Checking further up the joblog, I do notice this error, maybe it is useful?

/home/exx/cryosparc/cryosparc_worker/bin/cryosparcw: line 134:  9985 Terminated              python -c "import cryosparc_compute.run as run; run.run()" "$@"
exception in force_free_cufft_plan: 'NoneType' object has no attribute 'handle'
exception in force_free_cufft_plan:
exception in cufft.Plan.__del__:
exception in cufft.Plan.__del__:
exception in cufft.Plan.__del__:
exception in force_free_cufft_plan: 'NoneType' object has no attribute 'handle'
exception in force_free_cufft_plan:
exception in cufft.Plan.__del__:
exception in cufft.Plan.__del__:
exception in cufft.Plan.__del__:
exception in force_free_cufft_plan: 'NoneType' object has no attribute 'handle'
exception in force_free_cufft_plan:
exception in cufft.Plan.__del__:
exception in cufft.Plan.__del__:
exception in cufft.Plan.__del__:
exception in force_free_cufft_plan: 'NoneType' object has no attribute 'handle'
exception in force_free_cufft_plan:
exception in cufft.Plan.__del__:
exception in cufft.Plan.__del__:
exception in cufft.Plan.__del__:
exception in force_free_cufft_plan: 'NoneType' object has no attribute 'handle'
exception in force_free_cufft_plan:
exception in cufft.Plan.__del__:
exception in cufft.Plan.__del__:
exception in cufft.Plan.__del__:
exception in force_free_cufft_plan: 'NoneType' object has no attribute 'handle'
exception in force_free_cufft_plan:
exception in cufft.Plan.__del__:
exception in cufft.Plan.__del__:
exception in cufft.Plan.__del__:
exception in force_free_cufft_plan: 'NoneType' object has no attribute 'handle'
exception in force_free_cufft_plan:
exception in cufft.Plan.__del__:
exception in cufft.Plan.__del__:
exception in cufft.Plan.__del__:
exception in force_free_cufft_plan: 'NoneType' object has no attribute 'handle'
exception in force_free_cufft_plan:
exception in cufft.Plan.__del__:
exception in cufft.Plan.__del__:
exception in cufft.Plan.__del__:
exception in force_free_cufft_plan: 'NoneType' object has no attribute 'handle'
exception in force_free_cufft_plan:========= sending heartbeat

EDIT: Downloading the job report doesn’t work in Safari, but it does in Chrome. Here you go:
[link deleted]

Thank you for sending the job report.
On Chrome, I get the spinning wheel (and will eventually be offered a pdf file for download) when I push the left section of the button.
image
Please can you confirm whether, on Safari, the spinning appears even when you push the right section of the button?

On safari, I pushed the right side (dropdown–>download job report) and the job report does not download; just confirmed that the same happens for other jobs too.

Update - it did eventually complete (~10hrs to complete for one job). One thing I noticed was that if I “mark as complete” while a job is already completing, the completion checks restart from scratch - for large classification jobs that take many hours to finalize, this can be a problem.

On safari, I pushed the right side (dropdown–>download job report) and the job report does not download; just confirmed that the same happens for other jobs too.

Bug noted (seems to be Safari specific – we’ll push out a fix in the next release).

Update - it did eventually complete (~10hrs to complete for one job). One thing I noticed was that if I “mark as complete” while a job is already completing, the completion checks restart from scratch - for large classification jobs that take many hours to finalize, this can be a problem.

“Mark as complete” runs the same validation process that gets triggered at the end of standard job but runs it on the master node instead of the worker node – this is most likely why it’s taking significantly longer than the usual end-of-job validation (which I assume most likely takes a few hours for 3D class with ~80 classes and ~1M particles?).

We’re now investigating ways in which we can make this validation process more efficient so the worker/master distinction won’t make such a drastic difference. Thanks for the feedback!

1 Like

@olibclarke (and others) – the latest patch we released this morning includes some cacheing that should significantly speed up this final job validation (on our machines a 100 class 1.2M particle 3D class job goes from 2.5h+ to 10-15min). This should affect normal job completion times and also the ‘mark as complete’ times. Appreciate any feedback when you get the chance!

2 Likes

I, have a similar problem in CS v3.3.2 - a job with 50 classes and 1.2 million particles completed (it technically look like it is complete, because I can download the maps, yet it is marked as failed). This does not allow me to use any of the outputs for downstream analysis. I manually marked the job as complete (this was 48 hours ago), but the job is still marked as red and connected jobs are in statues “queued because inputs are not ready”. Unfortunately I don’t have command line access - is there anything else that can be done to rescue it?
Here are the last lines of the log.

Many thanks,
Matthias
[CPU: 15.70 GB] Outputting data…

[CPU: 27.19 GB] Zipping files…

[CPU: 27.19 GB] …done.

[CPU: 27.19 GB] Finished iteration 289 in 7383.507s. Total time so far 172071.240s

[CPU: 27.19 GB] ====== Done 3D Classification ======

[CPU: 27.19 GB] Full run took 172071.246s

[CPU: 16.03 GB] --------------------------------------------------------------

[CPU: 16.03 GB] Compiling job outputs…

[CPU: 16.03 GB] Passing through outputs for output group particles_all_classes from input group particles

[CPU: 2.21 GB] Finalizing Job…

@mvorlaender unfortunately, there’s nothing else you can do. When you press ‘mark as complete’, we perform the end-of-job validation on the master node. This can take quite a while (prior to the latest patch) if it is underpowered.

ok, thanks for the reply!