Jobs marked failed but did not fail

open

#1

Hi we have established a cryosparc2 installation on the cluster. Most jobs run smoothly however a couple of jobs are soon marked to be failed but actually run to the end. For instance I got the following end of the a joblog:

0.143 at 86.100 radwn. 0.5 at 57.265 radwn. Took 32.867s.
FSC Noise Sub… ========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= main process now complete.
========= monitor process now complete.

While this looks good, and also the results of the job look good, the job is marked as failed. The problem is now I can’t really use the results for another job as the jobs as cryosparc thinks the job is failed. Can I mark a job manually as finished to overcome this issue?

Best,

David


#2

Hi @david.haselbach,

Is it just one job type that fails without actually failing?

You can manually override the status of a job through the MongoDB shell.

cryosparcm mongo
> db.jobs.update({ project_uid: ‘P1’, uid: ‘J1’ }, {$set: { status: ‘completed’ }})

Replace P1 and J1 with the respective project UID and job UID. Please double check everything before running (and use the mongo shell only when really necessary)!

- Suhail


#3

This helps but it is happening now more and more often And now I even get stops in refinements:

Here is an example output:
======== sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
cryosparc2_compute/plotutil.py:237: RuntimeWarning: divide by zero encountered in log
logabs = n.log(n.abs(fM))
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
FSC No-Mask… ========= sending heartbeat
0.143 at 47.212 radwn. 0.5 at 29.354 radwn. Took 11.213s.
FSC Spherical Mask… ========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
0.143 at 55.576 radwn. 0.5 at 39.156 radwn. Took 15.641s.
FSC Loose Mask… ========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
0.143 at 64.936 radwn. 0.5 at 48.443 radwn. Took 69.148s.
FSC Tight Mask… ========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
0.143 at 74.440 radwn. 0.5 at 54.932 radwn. Took 49.658s.
FSC Noise Sub… ========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
cryosparc2_compute/sigproc.py:765: RuntimeWarning: invalid value encountered in divide
fsc_true = (fsc_t - fsc_n) / (1.0 - fsc_n)
0.143 at 74.251 radwn. 0.5 at 54.165 radwn. Took 94.160s.
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
FSC No-Mask… ========= sending heartbeat
0.143 at 87.402 radwn. 0.5 at 53.229 radwn. Took 10.785s.
FSC Spherical Mask… ========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
0.143 at 111.035 radwn. 0.5 at 67.943 radwn. Took 15.559s.
FSC Loose Mask… ========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= main process now complete.
========= monitor process now complete.


#4

Can you show us the submission script of one of those jobs ?
I got the same kind of error when adding srun (SLURM) to the cryoSPARC command in my scripts.


#5

sure:

#SBATCH --job-name cryosparc_P14_J22
#SBATCH -n 4
#SBATCH --gres=gpu:1
#SBATCH -p gpu
#SBATCH --qos=medium
#SBATCH --mem=24000MB
#SBATCH -o /groups/haselbach/Susanne/So_022/180711_So22_again/cryosparc/P14/J22/P14_J22.out
#SBATCH -e /groups/haselbach/Susanne/So_022/180711_So22_again/cryosparc/P14/J22/P14_J22.err

available_devs=""
for devidx in $(seq 0 15);
do
if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then
if [[ -z “$available_devs” ]] ; then
available_devs=$devidx
else
available_devs=$available_devs,$devidx
fi
fi
done
export CUDA_VISIBLE_DEVICES=$available_devs


#6

I am still having similar trouble and I tried the mongo db comment:

db.jobs.update({ project_uid: ‘P52’, uid: ‘J20’ }, {$set: { status: ‘completed’ }})

But I am getting
2019-02-09T22:18:15.040+0100 E QUERY [thread1] SyntaxError: illegal character @(shell):1:30

Any idea what is wrong?


#7

Hi @david.haselbach,

MongoDB may be complaining about the formatted quote symbols. Try this:

db.jobs.update({ project_uid: 'P52', uid: 'J20' }, { $set: { status: 'completed' } })

- Suhail