Set worker for CPU only job

Hi,

We are starting to move our workstations to a master/worker config with one of our workstations acting as a master and the rest as workers. This seems very nice and we are able to send jobs to the desired worker by selecting the target GPU.

The only problem that I have noticed is with CPU only jobs, for example when extracting from micrographs if one has a lot of CPUs available this can be faster than with GPUs. It would be nice if it would also be possible to direct these CPU only jobs to the same worker where the micrographs are stored for example.

Thank you for considering this feature request.

Thanks for your feature suggestion @jcoleman.

A workaround in recent releases might be to reassign nodes with special CPU capabilities to their dedicated scheduler lane. The drawback would be that such reassignment would remove a node from its current lane, and thus from its current lane’s pool of resources.

Suppose the “CPU-heavy” worker cpuworker.local is already connected as a worker to the CryoSPARC master server at csmaster.local running on port 61000. Then one could move the worker to its dedicated, new lane cpuworkers with the command (run on cpuworker.local):

/path/to/cryosparc_worker/bin/cryosparcw connect --update \
    --master csmaster.local --port 61000 \
    --worker cpuworker.local --newlane --lane cpuworkers

Please ensure that such an assignment does not leave behind an empty lane (of which the reassigned worker was a member previously.)

Hi @wtempel thanks for taking the feature request! Related to this type of configuration, I am wondering if in cryosparc live we could expose the ability to assign live worker to specific workers like we can when assigning a job? It’s not a big deal but it may help to ensure that the worker who is assigned to the live job has local access to the data rather than having to read it through the network. Thank you!

@jcoleman To ensure a common understanding of your use case, please can you post a screenshot of an example job submission where you use the existing capability? Are you referring to the Run on specific GPU option?

@wtempel that’s right, what I am referring to is the ‘run on specific GPU’ option. Let me know if you need a screenshot and I’ll post when I get to the office.

Not needed given the confirmation

Given the caveat for
Run on specific GPU that the override of the scheduler may result in resource conflicts with other running jobs, would it be helpful and sufficient for your use case if one could select a specific worker host, rather than specific GPU device on a specific worker host?

Yes definitely that would be even better!

In this case you could create a separate lane (which one can select as a Live lane or when queuing non-Live jobs) for each gpu computer. To help me propose suitable commands, please can you post the outputs of these commands on the CryoSPARC master computer:

crosparcm status | grep -e HOSTNAME -e BASE_PORT
cryosparcm cli "get_scheduler_targets()"

Hi @wtempel, that’s not really what I would want to do because sometimes the location of the micrographs could be on a different worker for instance depending on which workstation has available space and computation so that would reduce flexibility, unless I am not understanding correctly. It would be very useful though if we could select specific worker when we set the job and have that sent to the scheduler.

Suppose there are CryoSPARC worker node “targets” csn1 and csn2 that are both currently part of the the default lane. Wouldn’t moving csn1 to a new lane Lane1 and csn2 to a new lane Lane2 (and removing the default lane, if it is empty) provide this kind of control?

Hi @wtempel ah yes ok now I understand, thank you and sorry for being slow.

Here is the output from the commands that you suggested:

cryosparcm status | grep -e HOSTNAME -e BASE_PORT
export CRYOSPARC_MASTER_HOSTNAME="spector.structbio.pitt.edu"
export CRYOSPARC_BASE_PORT=39000

cryosparcm cli "get_scheduler_targets()"
[{'cache_path': '/scratch', 'cache_quota_mb': 3000000, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 11539054592, 'name': 'NVIDIA GeForce RTX 2080 Ti'}, {'id': 1, 'mem': 11539054592, 'name': 'NVIDIA GeForce RTX 2080 Ti'}, {'id': 2, 'mem': 11538923520, 'name': 'NVIDIA GeForce RTX 2080 Ti'}], 'hostname': 'spector.structbio.pitt.edu', 'lane': 'default', 'monitor_port': None, 'name': 'spector.structbio.pitt.edu', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31], 'GPU': [0, 1, 2], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]}, 'ssh_str': 'cryosparcuser@spector.structbio.pitt.edu', 'title': 'Worker node spector.structbio.pitt.edu', 'type': 'node', 'worker_bin_path': '/data/opt/cryosparc/cryosparc_worker/bin/cryosparcw'}, {'cache_path': '/scratch', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 11539054592, 'name': 'NVIDIA GeForce RTX 2080 Ti'}, {'id': 1, 'mem': 11538923520, 'name': 'NVIDIA GeForce RTX 2080 Ti'}, {'id': 2, 'mem': 11707547648, 'name': 'NVIDIA GeForce GTX 1080 Ti'}], 'hostname': 'sitak.local', 'lane': 'default', 'monitor_port': None, 'name': 'sitak.local', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31], 'GPU': [0, 1, 2], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]}, 'ssh_str': 'cryosparcuser@sitak.local', 'title': 'Worker node sitak.local', 'type': 'node', 'worker_bin_path': '/data/opt/cryosparc/cryosparc_worker/bin/cryosparcw'}, {'cache_path': '/scratch', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 11707547648, 'name': 'NVIDIA GeForce GTX 1080 Ti'}, {'id': 1, 'mem': 11707547648, 'name': 'NVIDIA GeForce GTX 1080 Ti'}], 'hostname': 'lakota.local', 'lane': 'default', 'monitor_port': None, 'name': 'lakota.local', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55], 'GPU': [0, 1], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55]}, 'ssh_str': 'cryosparcuser@lakota.local', 'title': 'Worker node lakota.local', 'type': 'node', 'worker_bin_path': '/shared/spector/data/opt/cryosparc/cryosparc_worker/bin/cryosparcw'}, {'cache_path': '/b24ssd1', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 8354398208, 'name': 'NVIDIA GeForce RTX 2060 SUPER'}, {'id': 1, 'mem': 8354398208, 'name': 'NVIDIA GeForce RTX 2060 SUPER'}, {'id': 2, 'mem': 8354398208, 'name': 'NVIDIA GeForce RTX 2060 SUPER'}], 'hostname': 'b24.local', 'lane': 'default', 'monitor_port': None, 'name': 'b24.local', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47], 'GPU': [0, 1, 2], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47]}, 'ssh_str': 'cryosparcuser@b24.local', 'title': 'Worker node b24.local', 'type': 'node', 'worker_bin_path': '/data/opt/cryosparc/cryosparc_worker/bin/cryosparcw'}]

The current configuration bundles all worker in a single scheduler lane called default.
To allow users to queue a job to, for example, b24.local specifically,

  1. login in to that worker
    ssh cryosparcuser@b24.local
    
  2. confirm that the command
    hostname -f
    
    prints
    b24.local
    
  3. and run
    /data/opt/cryosparc/cryosparc_worker/bin/cryosparcw connect \ 
        --master spector.structbio.pitt.edu --port 39000 \
        --worker $(hostname -f) --ssdpath /b24ssd1 \
        --newlane --lane $(hostname -s) --update
    

This command should move the b24.local worker node from the default lane to a lane called b24.
You may follow these steps for each of the workers, keeping in mind

  1. A job queued to a single-node lane would have to wait for a GPU or GPUs on that node, even if GPUs on other nodes are idle.
  2. If you chose to remove all workers from the default lane, after all workers have been removed, remove the default lane with the command
    cryosparcm cli "remove_scheduler_lane('default')"
    
    Otherwise, jobs may still be queued to lane default, but would never run because the lane has no workers.

As an aside, I noticed that cryosparcw is in the /data/opt/cryosparc/cryosparc_worker/bin/ directory on all workers except on lakota.local, where cryosparcw is inside /shared/spector/data/opt/cryosparc/cryosparc_worker/bin. Have you confirmed that jobs run properly on lakota.local?

@wtempel great thank you! We have been able to run jobs on lakota but I do see your point and I will investigate fixing this so that it is the same as the other workers.