Jobs queue up, won't run, GUI throws up 'hostname' dialog

BryanDH · February 23, 2020, 7:39pm

Hello!

We’ve run into a strange issue where any job that is submitted, it will queue up but not run. The GUI (screenshot attached) briefly shows a dialog that says ‘hostname’ then the job will stay in the queue.

cryosparcm log command_core shows this repeatedly:

****** Scheduler Failed ****
---------- Scheduler running ---------------
Jobs Queued: [(u’P10’, u’J61’)]
Traceback (most recent call last):
File “cryosparc2_command/command_core/init.py”, line 187, in background_worker
scheduler_run_core() # sets last run time
File “cryosparc2_command/command_core/init.py”, line 1592, in scheduler_run_core
alloc_hostname = j[‘resources_allocated’][‘hostname’]
KeyError: ‘hostname’
****** Scheduler Failed ****

A look at the database log shows no errors.

All of our worker nodes show up in the Instance Information tab in the Resource Manager.

I’ve tried reinstalling and forcing deps to reinstall. Still the same error. There was a similar post with this error, but no resolution was posted.

Any help would be greatly appreciated!
Bryan

nfrasser · February 26, 2020, 7:35pm

Hi @BryanDH can you check the Resource Manager and see if there any jobs Queued? Clear those jobs out and try running a job again.

If that doesn’t work, try removing the scheduler lanes from the command line:

cryosparcm cli "remove_scheduler_lane('ugecluster')"

and re-adding it using the worker setup instructions: Installation - Standalone worker

BryanDH · February 26, 2020, 10:07pm

Hi:

Thanks for your response. No queued jobs, and I removed all workers and just have a default node lane. I added the workers back in but no luck.

I’ve been messing around with the __init__.py script located in <install path>/cryosparc2_master/cryosparc2_command/command_core

It seems the output from this code:

#find all running jobs
jobs_running  = mongo.db['jobs'].find({'status' : {'$in' : com.job_alive_statuses}}, {'resources_allocated' : 1})

is returning something unexpected for the next block of code:

#subtract currently running jobs from slots avail
        for j in jobs_running:
            alloc_hostname = j['resources_allocated']['hostname']
            alloc_slots = j['resources_allocated']['slots']
            alloc_license = j['resources_allocated'].get('licenses_acquired', 0)
            # fixed dont get used up so ignore here
            if alloc_hostname in slots_avail:
                for slottype in alloc_slots.iterkeys():
                    slots_avail[alloc_hostname][slottype] -= set(alloc_slots[slottype])
            licenses_used += alloc_license
        # now slots_avail is currently correct

I added a print statement to dump out the contents of the “jobs_running” dictionaries when I queue up a job.

 for x in jobs_running:
            print x.items()

With an entirely empty queue, and a single job submitted, the output is:

[(u'resources_allocated', {}), (u'_id', ObjectId('5e50664222de3ef32befe4e7'))]

so I’m not sure what direction to go in next.

I have reinstalled fresh and restored a backup of the database with the same results.

I’m running: v2.14.3-live_privatebeta, CentOS Linux release 7.7.1908

stephan · February 27, 2020, 7:36pm

Hey @BryanDH,

Try this:

Clear the job queue of any jobs
Open up an interactive python shell by running cryosparcm icli
Then, run:

jobs_running = list(db['jobs'].find({'status' : {'$in' : rc.com.job_alive_statuses}}, {'project_uid' : 1, 'uid' : 1}))
for job in jobs_running:
    cli.set_job_status(job['project_uid'], job['uid'], 'failed')

BryanDH · February 27, 2020, 8:18pm

Ah! That fixed it! Thanks!! So maybe a job didn’t complete and got stuck in some limbo state?

stephan · February 27, 2020, 8:23pm

Hey @BryanDH,

It’s a bit confusing for us as well- not sure why it happened which is why I gave you the brute force method to resolve it. Something must’ve went wrong during the scheduling process, but not sure what. Please let us know if it happens again!

posertinlab · November 18, 2022, 4:24am

Hello — I know it’s been a while, but I had this error just now and the brute force solution resolved it for me too. I believe what may have caused this is we lost a ZFS mount unexpectedly. I rebooted the system after we got it back, but maybe something didn’t close out right…?

yoshiokc · December 13, 2022, 7:40pm

I’ve hit this twice, a few months ago and just today. Going to take a look and try to figure out which job was breaking the works- I just cleared them all from the GUI to make things allocate again.

yoshiokc · December 17, 2022, 5:18pm

We hit this again a few more times, unsure wether the incidence rate has gone up after the update, or because a particular user’s interactions keep triggering it, but I narrowed it down to a single 2D streaming classification job being marked as deleted but also running. This breaks the scheduler and also the display of active jobs in the GUI. Setting this deleted job’s status to ‘failed’ cleared things up.

wtempel · December 19, 2022, 7:47pm

Please can you email us the file produced by
cryosparcm snaplogs and the project, job UIDs of the relevant classification job.

wtempel · December 22, 2022, 3:33pm

Thank you for sending the logs. We noticed an error that has been addressed in patch 221221 for CryoSPARC v4.1.1. Please see the guide for patch instructions . Does the streaming 2D classification job status issue persist even after (updating to CryoSPARC v4.1.1 if needed) and applying the patch?

yoshiokc · January 6, 2023, 1:11am

I haven’t noticed the issue since applying the patch, but before the patch, I had noticed that another job the seemed to reliably trigger the issue of jobs not showing up in the GUI (but jobs kept queuing OK AFAICT) was a ‘Curate Exposures’ job.

wtempel · January 6, 2023, 1:39pm

Thanks for the feedback. If you again encounter this problem after patching, please post details. As I will now close this topic, please open a new one if necessary.