All new jobs get queued, don't run

istv01 · May 3, 2017, 3:44pm

We have a multi user install on Rhel7 server. After a user ran a job yesterday, all his new jobs are queued and don’t run. I restarted cryosparc, installed the latest update, rebooted the computer. Still cannot start his job. He killed the job and created some new ones, they go into queuing and just sit there. I would appreciate any suggestion, can’t find anything on this problem.

apunjani · May 4, 2017, 3:44pm

HI @istv01,

Is this only happening for one user? Can other users start their jobs?
On the dashboard page, does the stats section show and “active jobs”?

Can you paste the output of
cryosparc configure gpu list

Thanks,
Ali

istv01 · May 16, 2017, 4:50pm

Hi Ali,

This was the gpu config:

Detected 1 CUDA devices.

enabled id pci-bus name

  x     0      0000:05:00.0  Quadro K2000

Tried a lot of different things, including reinstalling cryosparc - no luck, refinement jobs always get queued and don’t run. Ab initio runs ok.

Since this wasn’t working, I rebuilt the box with a Quadro P4000 card, reinstalled the OS (RHEL7.3), CUDA 8 and cryosparc. Ab initio runs great, but refinement goes into the queue and just sits there. There is nothing else running on this box and at least it should give a warning or error of why doesn’t want to run.

apunjani · May 17, 2017, 4:21pm

Hi @istv01,

You may have gotten an email last night from my reply to this post - unfortunately our forum went down for a few hours last night and my reply was lost. The gist was:

CryoSPARC’s scheduler checks both GPU and RAM availability when launching jobs, and refinement jobs require the system to have at least 24GB RAM before they get launched. How much RAM does your system have?
If you want to override the RAM check, open the config.sh file in your cryosparc installation folder, and edit the line:

export CRYOSPARC_RAM_SLOTS="X"
change the X to a number greater than 4, save the file, then
cryosparc stop && cryosparc start

Please let us know if that works.

Thanks,
Ali

istv01 · May 17, 2017, 4:26pm

Hi Ali,

Thanks for your email, got it last night.
The box has 12GB of RAM. If I change in config.sh the RAM_SLOTS number to 4 the refinement jobs start.
We’re in the process of getting more RAM.
The performance on the Quadro P4000 card is ~10X faster compared to the Quadro K2000 card, impressive.

For very large jobs would the machine need a 2TB SSD or 1TB would be enough for most things?

i-

apunjani · May 17, 2017, 4:31pm

Nice! We have just taken delivery of a couple of Quadro P100s and excited to see performance on those.

The required SSD space depends on the box-size of the particles - generally a 1M particle stack of large-ish particles will take up 1TB on it’s own.
In our dev servers we have either 1x or 2x 1.2TB SSDs.

istv01 · May 17, 2017, 5:20pm

For Quadro cards Nvidia says that single-precision performance is 2x of the double-precision performance. Does this mean that in single-precision mode a Quadro card with X CUDA cores can perform as a Geforce card with 2X CUDA cores?

apunjani · May 17, 2017, 5:30pm

No I think it is only a comparison between the single- and double-precision performance on the Quadro cards themselves. I believe the GeForce and Quadro card (with the same chipset) are about evenly matched for single-precision performance (all computation in cryoSPARC on GPUs is single precision)

clil16 · June 7, 2017, 4:59pm

Hi @apunjani

Would it be possible to add something in the queue or experiment page that informs the user why a job is not running?

istv01 · June 7, 2017, 5:16pm

Hi Ali,

I agree, it would be very useful to have more verbose error reporting. When a job fails there are no details of why. It would be very useful to know if the job needs more memory or disk space or what.

frostythebiochemist · August 23, 2017, 9:03pm

I am having the same problem as this user. New jobs go into the queue, accumulate there, but never start despite an idle machine. The queue doesn’t change even after killing the job, deleting the experiment or clearing the associated data.

I see that this issue is marked as solved, but I don’t see the solution in the forum.

I found this on the GitHib Bug tracker and it worked for me after a stop and restart:

For now, to remove all jobs from the queue (but keep all the experiments etc):

From within the cryosparc installation directory:

eval $(cryosparc env)
source config.sh
mongo localhost:$CRYOSPARC_MONGO_PORT
Now in the database shell:

use meteor
db.jobs.update({status:“queued”}, {$set:{status:“killed”}})
exit

istv01 · August 24, 2017, 8:01pm

Refinement jobs require 24GB RAM each by default. I had only 12GB so had to set in the config.sh file inside the cryosparc installation directory:
export CRYOSPARC_RAM_SLOTS=‘X’
with X=4. Then restartarted cryosparc and worked.

frostythebiochemist · August 24, 2017, 10:37pm

Thanks istv01. I don’t think that was my issue, I had at least 128GB of RAM available. I am still not sure what led to the bug, but clearing db.jobs.update worked after a stop and restart.

apunjani · September 5, 2017, 11:40pm

Hi @frostythebiochemist and @istv01,

Thanks for pointing out the fix/link to the issue tracker.
The bug appears to happen when queued experiments are deleted without clearing them from the queue - we’re still working to sort it out.

Thanks,
Ali