New jobs start when all GPUs are still busy

peter.cherepanov · July 21, 2023, 8:29pm

I have three standalone installations of cryosparc (with master and worker on the same workstation), running versions 4.2.1 and and 4.1.0. In all cases, queuing does not work, and the master launches every new job without waiting for GPU resources to become available, causing the jobs to crash. Is this a common issue?

Peter

wtempel · July 21, 2023, 8:42pm

What kind of workloads (CryoSPARC, non-CryoSPARC) are already running when a CryoSPARC job is being started on a GPU that is still busy?

peter.cherepanov · July 21, 2023, 8:49pm

no non-cryosparc jobs are running.
The master would simultaneously launch heavy 2D classification jobs with millions of particles on all GPUs.

wtempel · July 21, 2023, 8:53pm

Please can you post, for one standalone installation were this “over-booking” occurs, the outputs of

cryosparcm cli "get_scheduler_targets()"
cryosparcw gpulist
nvidia-smi

peter.cherepanov · July 21, 2023, 10:51pm

I tried on one of the machines with 2 GPUs. A big 2D job is running on both GPUs. I submit another big job on 2 GPUs. It is launched immediately, and fails before I have a chance to run the commands.
The error is:

Failed to launch! 255
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
Permission denied, please try again.
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
Permission denied, please try again.
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).

I tried it on another machine, with exactly the same result.

peter.cherepanov · July 21, 2023, 11:09pm

I checked “instance” info on one of the machines (Arnie), and it lists two targets:

Arnie (with 2 GPUs)
“Arnie.domain.org” (with 2 GPUs).

Basically, it seems the same worker node is duplicated under two names! Now I see that the first job is submitted to Arnie and the second to “Arnie.domain.org”, which means the scheduler works just fine.

Is it possible to edit the list of targets to remove duplicates?

nis_it · July 22, 2023, 2:50am

I personally like having all workers using the FQDN so I’d run:

cryosparcm cli "remove_scheduler_lane('Arnie')"

so you only have one lane with Arnie.domain.org as the only target

peter.cherepanov · July 22, 2023, 6:00am

this did not work:

cryosparcm cli “remove_scheduler_lane(‘arnie’)”
None

nis_it · July 22, 2023, 2:17pm

Can you show the output of cryosparcm cli "get_scheduler_targets()"
And a screen shot of the instance tab in the webgui?

peter.cherepanov · July 22, 2023, 2:44pm

cryosparcm cli “get_scheduler_targets()”
[{‘cache_path’: ‘/scratch/cherep01/cryosparc’, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 10000, ‘desc’: None, ‘gpus’: [{‘id’: 0, ‘mem’: 11721506816, ‘name’: ‘GeForce GTX 1080 Ti’}, {‘id’: 1, ‘mem’: 11718230016, ‘name’: ‘GeForce GTX 1080 Ti’}], ‘hostname’: ‘arnie.thecrick.org’, ‘lane’: ‘default’, ‘monitor_port’: None, ‘name’: ‘arnie.thecrick.org’, ‘resource_fixed’: {‘SSD’: True}, ‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], ‘GPU’: [0, 1], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]}, ‘ssh_str’: ‘cherep01@arnie.thecrick.org’, ‘title’: ‘Worker node arnie.thecrick.org’, ‘type’: ‘node’, ‘worker_bin_path’: ‘/data/cherep01/cryosparc/cryosparc_worker/bin/cryosparcw’}, {‘cache_path’: ‘/scratch/cherep01/cryosparc’, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 10000, ‘desc’: None, ‘gpus’: [{‘id’: 0, ‘mem’: 11721506816, ‘name’: ‘GeForce GTX 1080 Ti’}, {‘id’: 1, ‘mem’: 11718230016, ‘name’: ‘GeForce GTX 1080 Ti’}], ‘hostname’: ‘arnie’, ‘lane’: ‘default’, ‘monitor_port’: None, ‘name’: ‘arnie’, ‘resource_fixed’: {‘SSD’: True}, ‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], ‘GPU’: [0, 1], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]}, ‘ssh_str’: ‘cherep01@arnie’, ‘title’: ‘Worker node arnie’, ‘type’: ‘node’, ‘worker_bin_path’: ‘/data/cherep01/cryosparc/cryosparc_worker/bin/cryosparcw’}]

(it is currently running one job on 2 GPUs)

peter.cherepanov · July 22, 2023, 2:45pm

nis_it · July 22, 2023, 3:14pm

Once the job you have queued is completed; I’d just remove the default lane (and hence both targets) and then readd the worker…

remove the default lane:

cryosparcm cli "remove_scheduler_lane('default')"

add the worker again:

/data/cherep01/cryosparc/cryosparc_worker/bin/cryosparcw connect     --worker arnie.thecrick.org     --master arnie.thecrick.org     --port 39000     --ssdpath /scratch/cherep01/cryosparc   --newlane  --lane default --ssdreserve 10000

nis_it · July 22, 2023, 3:17pm

I"m not sure if this still works but you could try to remove just the target with:

cryosparcm cli 'remove_scheduler_target_node("'arnie'")'

or try: (I can’t remember if the target name needs to include the singlequote)

cryosparcm cli 'remove_scheduler_target_node("arnie")'

this will leave your existing default lane, and should leave the fqdn worker

nis_it · July 22, 2023, 3:18pm

for some reason my browser keeps formatting the single and double quotes odd, so you might need to replace those with normal ones if the commands aren’t interpreted correctly.

peter.cherepanov · July 22, 2023, 4:05pm

one of them worked! I now only have a single worker and queue works finally.

THANK you!

peter.cherepanov · July 22, 2023, 4:09pm

The syntax that works is:

cryosparcm cli ‘remove_scheduler_target_node(“‘arnie’”)’

Thanks a lot for your help!