We have a multi user install on Rhel7 server. After a user ran a job yesterday, all his new jobs are queued and don’t run. I restarted cryosparc, installed the latest update, rebooted the computer. Still cannot start his job. He killed the job and created some new ones, they go into queuing and just sit there. I would appreciate any suggestion, can’t find anything on this problem.
Is this only happening for one user? Can other users start their jobs?
On the dashboard page, does the stats section show and “active jobs”?
Can you paste the output of
cryosparc configure gpu list
This was the gpu config:
Detected 1 CUDA devices.
enabled id pci-bus name
x 0 0000:05:00.0 Quadro K2000
Tried a lot of different things, including reinstalling cryosparc - no luck, refinement jobs always get queued and don’t run. Ab initio runs ok.
Since this wasn’t working, I rebuilt the box with a Quadro P4000 card, reinstalled the OS (RHEL7.3), CUDA 8 and cryosparc. Ab initio runs great, but refinement goes into the queue and just sits there. There is nothing else running on this box and at least it should give a warning or error of why doesn’t want to run.
You may have gotten an email last night from my reply to this post - unfortunately our forum went down for a few hours last night and my reply was lost. The gist was:
CryoSPARC’s scheduler checks both GPU and RAM availability when launching jobs, and refinement jobs require the system to have at least 24GB RAM before they get launched. How much RAM does your system have?
If you want to override the RAM check, open the
config.sh file in your cryosparc installation folder, and edit the line:
change the X to a number greater than 4, save the file, then
cryosparc stop && cryosparc start
Please let us know if that works.
Thanks for your email, got it last night.
The box has 12GB of RAM. If I change in config.sh the RAM_SLOTS number to 4 the refinement jobs start.
We’re in the process of getting more RAM.
The performance on the Quadro P4000 card is ~10X faster compared to the Quadro K2000 card, impressive.
For very large jobs would the machine need a 2TB SSD or 1TB would be enough for most things?
Nice! We have just taken delivery of a couple of Quadro P100s and excited to see performance on those.
The required SSD space depends on the box-size of the particles - generally a 1M particle stack of large-ish particles will take up 1TB on it’s own.
In our dev servers we have either 1x or 2x 1.2TB SSDs.
For Quadro cards Nvidia says that single-precision performance is 2x of the double-precision performance. Does this mean that in single-precision mode a Quadro card with X CUDA cores can perform as a Geforce card with 2X CUDA cores?
No I think it is only a comparison between the single- and double-precision performance on the Quadro cards themselves. I believe the GeForce and Quadro card (with the same chipset) are about evenly matched for single-precision performance (all computation in cryoSPARC on GPUs is single precision)
Would it be possible to add something in the queue or experiment page that informs the user why a job is not running?
I agree, it would be very useful to have more verbose error reporting. When a job fails there are no details of why. It would be very useful to know if the job needs more memory or disk space or what.
I am having the same problem as this user. New jobs go into the queue, accumulate there, but never start despite an idle machine. The queue doesn’t change even after killing the job, deleting the experiment or clearing the associated data.
I see that this issue is marked as solved, but I don’t see the solution in the forum.
I found this on the GitHib Bug tracker and it worked for me after a stop and restart:
For now, to remove all jobs from the queue (but keep all the experiments etc):
From within the cryosparc installation directory:
eval $(cryosparc env)
Now in the database shell:
Refinement jobs require 24GB RAM each by default. I had only 12GB so had to set in the config.sh file inside the cryosparc installation directory:
with X=4. Then restartarted cryosparc and worked.
Thanks istv01. I don’t think that was my issue, I had at least 128GB of RAM available. I am still not sure what led to the bug, but clearing db.jobs.update worked after a stop and restart.
Hi @frostythebiochemist and @istv01,
Thanks for pointing out the fix/link to the issue tracker.
The bug appears to happen when queued experiments are deleted without clearing them from the queue - we’re still working to sort it out.