New jobs start when all GPUs are still busy

I have three standalone installations of cryosparc (with master and worker on the same workstation), running versions 4.2.1 and and 4.1.0. In all cases, queuing does not work, and the master launches every new job without waiting for GPU resources to become available, causing the jobs to crash. Is this a common issue?

Peter

What kind of workloads (CryoSPARC, non-CryoSPARC) are already running when a CryoSPARC job is being started on a GPU that is still busy?

no non-cryosparc jobs are running.
The master would simultaneously launch heavy 2D classification jobs with millions of particles on all GPUs.

Please can you post, for one standalone installation were this “over-booking” occurs, the outputs of

cryosparcm cli "get_scheduler_targets()"
cryosparcw gpulist
nvidia-smi

I tried on one of the machines with 2 GPUs. A big 2D job is running on both GPUs. I submit another big job on 2 GPUs. It is launched immediately, and fails before I have a chance to run the commands.
The error is:

Failed to launch! 255
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
Permission denied, please try again.
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
Permission denied, please try again.
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).

I tried it on another machine, with exactly the same result.

I checked “instance” info on one of the machines (Arnie), and it lists two targets:

  1. Arnie (with 2 GPUs)
  2. Arnie.domain.org” (with 2 GPUs).

Basically, it seems the same worker node is duplicated under two names! Now I see that the first job is submitted to Arnie and the second to “Arnie.domain.org”, which means the scheduler works just fine.

Is it possible to edit the list of targets to remove duplicates?

I personally like having all workers using the FQDN so I’d run:

cryosparcm cli "remove_scheduler_lane('Arnie')"

so you only have one lane with Arnie.domain.org as the only target

this did not work:

cryosparcm cli “remove_scheduler_lane(‘arnie’)”
None

Can you show the output of cryosparcm cli "get_scheduler_targets()"
And a screen shot of the instance tab in the webgui?

cryosparcm cli “get_scheduler_targets()”
[{‘cache_path’: ‘/scratch/cherep01/cryosparc’, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 10000, ‘desc’: None, ‘gpus’: [{‘id’: 0, ‘mem’: 11721506816, ‘name’: ‘GeForce GTX 1080 Ti’}, {‘id’: 1, ‘mem’: 11718230016, ‘name’: ‘GeForce GTX 1080 Ti’}], ‘hostname’: ‘arnie.thecrick.org’, ‘lane’: ‘default’, ‘monitor_port’: None, ‘name’: ‘arnie.thecrick.org’, ‘resource_fixed’: {‘SSD’: True}, ‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], ‘GPU’: [0, 1], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]}, ‘ssh_str’: ‘cherep01@arnie.thecrick.org’, ‘title’: ‘Worker node arnie.thecrick.org’, ‘type’: ‘node’, ‘worker_bin_path’: ‘/data/cherep01/cryosparc/cryosparc_worker/bin/cryosparcw’}, {‘cache_path’: ‘/scratch/cherep01/cryosparc’, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 10000, ‘desc’: None, ‘gpus’: [{‘id’: 0, ‘mem’: 11721506816, ‘name’: ‘GeForce GTX 1080 Ti’}, {‘id’: 1, ‘mem’: 11718230016, ‘name’: ‘GeForce GTX 1080 Ti’}], ‘hostname’: ‘arnie’, ‘lane’: ‘default’, ‘monitor_port’: None, ‘name’: ‘arnie’, ‘resource_fixed’: {‘SSD’: True}, ‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], ‘GPU’: [0, 1], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]}, ‘ssh_str’: ‘cherep01@arnie’, ‘title’: ‘Worker node arnie’, ‘type’: ‘node’, ‘worker_bin_path’: ‘/data/cherep01/cryosparc/cryosparc_worker/bin/cryosparcw’}]

(it is currently running one job on 2 GPUs)

Once the job you have queued is completed; I’d just remove the default lane (and hence both targets) and then readd the worker…

remove the default lane:

cryosparcm cli "remove_scheduler_lane('default')"

add the worker again:

/data/cherep01/cryosparc/cryosparc_worker/bin/cryosparcw connect     --worker arnie.thecrick.org     --master arnie.thecrick.org     --port 39000     --ssdpath /scratch/cherep01/cryosparc   --newlane  --lane default --ssdreserve 10000

I"m not sure if this still works but you could try to remove just the target with:

cryosparcm cli 'remove_scheduler_target_node("'arnie'")'

or try: (I can’t remember if the target name needs to include the singlequote)

cryosparcm cli 'remove_scheduler_target_node("arnie")'

this will leave your existing default lane, and should leave the fqdn worker

1 Like

for some reason my browser keeps formatting the single and double quotes odd, so you might need to replace those with normal ones if the commands aren’t interpreted correctly.

one of them worked! I now only have a single worker and queue works finally.

THANK you!

The syntax that works is:

cryosparcm cli ‘remove_scheduler_target_node(“‘arnie’”)’

Thanks a lot for your help!

2 Likes