Configuring workers to use a different master hostname

Is it possible to configure the worker to use a different master hostname than the one the master used during startup?

I am running a containerized master on a Kubernetes cluster. Due to our firewall rules, applications running inside the container cannot connect to ports exposed via the node (host). As a result, I have to use the pod (container) hostname or localhost when starting the services.

However, the worker is set up on HPC clusters managed by Slurm, and it can only communicate with the master through the node’s hostname. When I ran a test job, I got:

/cryosparc/worker/bin/cryosparcw run --project P1 --job J3 --master_hostname localhost --master_command_core_port 32002

I’m wondering if there is any way to configure either the master or the worker to use a different master hostname for communication.

Welcome to the forum @zqyou .

The --master_hostname parameter of the cryosparcw run command is usually inferred from the CRYOSPARC_MASTER_HOSTNAME variable, as assigned inside the cryosparc_master/config.sh file when CryoSPARC is started. Would it help if you were able to specify a different value for the --master_hostname parameter?

Thank you for your reply.

Yes, the job works if I manually specify the node hostname using --master_hostname. However, I don’t want to do this manually every time. Is there a way to modify that value after CryoSPARC has started?

Thanks.

I’m looking at the script template variables, such as run_cmd and run_args. Can I modify these variables to append my custom hostname parameter?

I figured out that run_cmd is equal to

{{ worker_bin_path }} run {{ run_args }} > {{ project_dir_abs }}/job.log 2>&1

So, I appended my hostname parameter to run_args, and it worked. Was my understanding correct?

This seems plausible. Would you mind posting the corresponding script template?

Sure. This is my script template:

#!/bin/bash
#### cryoSPARC cluster submission script template for SLURM
## Available variables:
## {{ run_cmd }}            - the complete command string to run the job
## {{ num_cpu }}            - the number of CPUs needed
## {{ num_gpu }}            - the number of GPUs needed. 
##                            Note: the code will use this many GPUs starting from dev id 0
##                                  the cluster scheduler or this script have the responsibility
##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up
##                                  using the correct cluster-allocated GPUs.
## {{ ram_gb }}             - the amount of RAM needed in GB
## {{ job_dir_abs }}        - absolute path to the job directory
## {{ project_dir_abs }}    - absolute path to the project dir
## {{ job_log_path_abs }}   - absolute path to the log file for the job
## {{ worker_bin_path }}    - absolute path to the cryosparc worker command
## {{ run_args }}           - arguments to be passed to cryosparcw run
## {{ project_uid }}        - uid of the project
## {{ job_uid }}            - uid of the job
## {{ job_creator }}        - name of the user that created the job (may contain spaces)
## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)
##
## What follows is a simple SLURM script:

#SBATCH --cluster=ascend
#SBATCH --account=PZS0722
#SBATCH --chdir {{project_dir_abs}}
#SBATCH --nodes=1
#SBATCH --ntasks-per-node={{ (ntasks_value|default(1)) }}  
#SBATCH --time={{ (time_value|default(60)) }}
#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}
#SBATCH --output={{ job_dir_abs }}/output.txt
#SBATCH --error={{ job_dir_abs }}/error.txt
#SBATCH --export=ALL,LD_PRELOAD=
{%- if num_gpu == 0 %}
{%- else %}
#SBATCH --gpus-per-node={{ num_gpu }}
{%- endif %}

export CRYOSPARC_SSD_PATH="${TMPDIR}"
# {{ run _cmds }}
{{ worker_bin_path }} run {{ run_args }} --master_hostname kubeworker23 > {{ project_dir_abs }}/job.log 2>&1

Is “hard-coding” the master hostname inside the script template compatible with your goal

No, that is the script after updating it from Lane. During Pod startup, I insert the current node’s hostname into the Lane script. I just wanted to confirm if this is the right way to do it.

I have another question regarding the SSD. When we installed the worker for the cluster, we did not specify --ssdpath because the SSD path will be specified later in a job using CRYOSPARC_SSD_PATH. However, when I run any job, I always get

  SSD   :  False

in the output. Did I miss something?

A future version of CryoSPARC may no longer use the
cryosparcw run --master_hostname parameter. If Pod startup involves a CryoSPARC startup, you might, instead of the --master_hostname parameter, try updating the CRYOSPARC_MASTER_HOSTNAME definition inside cryosparc_master/config.sh before CryoSPARC startup. Would that a approach still work under the constraints described in your first post?

If cluster_info.json did not include a "cache_path": definition when you ran cryosparcm cluster connect, you may want to try including a “dummy” definition that will later be overridden by $CRYOSPARC_SSD_PATH. For example, you may include in cluster_info.json
"cache_path": "/tmp" (possibly followed by a comma ,, depending on the position of the line inside cluster_info.json, before connecting or updating the cluster lane.

1 Like

No, I cannot use the hostname of the node that hosts the Pod during startup because the node’s firewall blocks traffic from Pods, causing several services to fail to start.

I was not referring to the node that hosts the container. Is the command cryosparcm start run within a container? If so, is a suitable (for the
--master_hostname parameter) hostname available inside that container?

Yes, we start the master inside a container.

Yes, the suitable value for --master_hostname parameter is available within the container.

Our worker is configured and runs outside the container. The suitable value in this case must be the hostname of the node that hosts the container.

In this case, could you inside the container

?

I have tried that before, but it did not work because our node firewall does not allow traffic from a container. During CryoSPARC startup, it attempts to test the services, but they do not respond because the firewall blocks them. As a result, several CryoSPARC master services failed.

Do I understand correctly that

  1. kubeworker23, which you mentioned in Configuring workers to use a different master hostname - #7 by zqyou, is an example (bare-metal, not container) hostname on the cluster?
  2. you have currently defined in cryosparc_master/config.sh
    export CRYOSPARC_MASTER_HOSTNAME=localhost
    
    to work around ?

If so, have you already tried configuring the container such that the container, but not other devices on the network, resolves the desired --master_hostname parameter to a loopback address, like 127.0.0.3? Would that allow allow CryoSPARC startup with a CRYOSPARC_MASTER_HOSTNAME definition that matches the desired
--master_hostname parameter?

Thanks. That sounds a good idea. I will test it and let you know if it works.

That loopback hack works perfectly and also resolves another issue my teammate previously reported with ephemeral ports. Thank you for your help.