SLURM JWT integration - developer suggestions

Hi,

I’ve been trying to integrate a cryosparc running on a virtual machine with our slurm-managed cluster.

After about a year of troubleshooting, I’ve found 2 pieces of code that I’ve had to manually edit to get this working that maybe you could implement a solution for in future releases.

The first is to have the worker node communicate back to the master on the vm. Because our internal network does not make a DNS entry for our VM hostnames. I’ve had to hardcode the internal IP address into the “MASTER_HOSTNAME” variable. I tried various “supported” fixes for this, but the hostname of the machine seemingly has to match the variable and you can’t just override the IP address as a variable:

IP=<your_IP>
FILE=~/cryosparc/cryosparc_master/cryosparc_command/command_core/__init__.py
SEARCH="'--master_hostname', os.environ\['CRYOSPARC_MASTER_HOSTNAME'],"
REPLACE="'--master_hostname', '${IP}',"
sed -i "s/${SEARCH}/${REPLACE}/g" $FILE 

This let’s the sbatch script get exported with the hardcoded IP address instead of the hostname.

The second thing I needed to change was in the “cluster.py” file that controls sbatch commands. Because I’m using a SLURM JWT to submit jobs, I need to export the token and although it’s set in my .bashrc file, for some reason the cryosparc webapp doesn’t see it. So, I needed to export the token within the “cluster_info.json” command file. This works fine for my “sbatch” submission command because that command does not use the “shlex.split(cmd)” call:

Functional call in “submit_cluster_script” function:

res = subprocess.check_output(cmd, stderr=subprocess.STDOUT, shell=True).decode()`

Nonfunctional call in “get_cluster_job_status”, “get_cluster_job_status_code”, “delete_cluster_job” functions:

res = subprocess.check_output(shlex.split(cmd), stderr=subprocess.STDOUT).decode()

When I change the nonfunctional calls to the functional one, it works. *I noticed in previous cryosparcs they are all the functional call, I’m sure there’s a reason you changed, but it might be a good idea to change back. Or create a “SLURM_JWT” option, which allows you to use this version.

*for changes to take affect, “cryosparcm restart” was needed.

Hope this helps!