Setting up cryosparc to send jobs into multiple nodes!

73km · June 7, 2021, 5:14pm

I have cryosparc installed and setup to run on gpu node of our cluster setup. But there are other nodes available and I want to be able to select the node to queue the job to. Since not all jobs require gpu, it would be more efficient to send each job to a different partition. How do I set this up?

anon27333478 · June 7, 2021, 10:14pm

Hi @73km,

I’d like to understand you issue in a bit more detail. Have you currently setup cryoSPARC on a one cluster and you’re looking to split up the cluster into specific nodes, or do you already have your nodes as separate workers?

Addy.

73km · June 8, 2021, 4:28am

So in our cluster setup there are multiple partitions/nodes. Over a hundred CPU nodes and only 2 GPU nodes. While setting up cryosparc, I set partition=gpu [for the gpu node] because that’s where cuda drivers were and cryosparc can make use of gpu. But I realize not all jobs require GPUs. So now I want to be able to select which partition to send the job to based on the requirement. In tutorial videos, I see option to select different lanes while queuing jobs. It would be nice if I can make those lanes option with different partitions we have.

jucastil · June 9, 2021, 8:10am

Hi @73km,
Do you use slurm on your cluster?
if you have some lane definition for the GPU partition, just copy that and modify:

The cluster_info.json file so that “name” is different (that will be the name of your new line)
The default cluster_script.sh so that the parameter
#SBATCH --partition=your_gpu_slurm_partition
points to your new partition (your_cpu_partition).

Then run cryosparcm cluster connect and your new lane should be visible on your web.
You will still need to choose the lane when you launch the job…
Best,
Juan

73km · June 9, 2021, 5:08pm

Thanks! It worked! I wish I could put multiple of these cluster scripts so that I can monitor them simultaneously. It seems I can put only one file at a time.

stephan · June 16, 2021, 7:20pm

Hi @73km,

If I’m understanding your post correctly, you’d like to have a copy of each set of cluster integration scripts.
You can do the following:
Create a folder called cryosparc_cluster_scripts adjacent to cryosparc_master and cryosparc_worker. Inside this folder, create a new folder for each partition (you can run cryosparcm cluster dump <cluster name> inside each folder to get cryoSPARC to write out the two files for that cluster integration):

-  cryosparc_cluster_scripts
  - partition1
    - cluster_info.json
    - cluster_script.sh
  - partition2
    - cluster_info.json
    - cluster_script.sh
  - partition3
    - cluster_info.json
    - cluster_script.sh

When you run cryosparcm cluster connect, it reads the two files required from your current working directory. Whenever you want to make modifications to any of the scripts, navigate to the folder of the partition you want to modify, modify the files, then run the commands.
This way you don’t have to keep modifying a single set of files in order to update each configuration.

Rajiv-Singh · July 8, 2021, 4:15am

Hey Stephan,

I have installed cryoSPARC Master on one of our login nodes. I am wondering where do I install cryoSPARC Worker (Same login node or on one of the cluster nodes with GPU) and then run the cluster_info_jason as well as cluster_script.sh.

If I am using one of the GPU nodes for installing cryoSPARC worker node, the issue of hostname arises at different GPU node (when slurm assign different GPU node than one onto which worker installed).

How to resolve Hostname issues when I have to use various GPU compute nodes for cryoSPARC worker nodes?

Kindly advise.

Best,
Rajiv

stephan · August 25, 2021, 6:10pm

Hi @Rajiv-Singh,

Take a look at our Hardware and Software Requirements guide: https://guide.cryosparc.com/setup-configuration-and-management/hardware-and-system-requirements
One of the requirements for a cluster system is to have a shared storage volume across the worker nodes and master nodes. This means that when you install the worker, it will be available to all the nodes in the cluster at the same location. You have to install the worker node on one of the GPU nodes (mainly because the installer requires the CUDA toolkit to compile one if its dependencies). Once that’s done, you set up the cluster integration, explained here: https://guide.cryosparc.com/setup-configuration-and-management/how-to-download-install-and-configure/downloading-and-installing-cryosparc#connect-a-cluster-to-cryosparc
You don’t have to worry about managing hostnames, as the cluster will do that for you. CryoSPARC will only submit jobs to the cluster, and the cluster will take care of the rest.

Rajiv-Singh · August 26, 2021, 1:41am

Hi @stephan,

I have followed the same guidelines and installed master and worker on shared storage volumes. Installed Master on one of our login nodes and worker on GPU node and connected on the cluster using cluster_info.json cluster_script.sh incorporating parameter of our cluster resulted in added a “Lane slurm cryosparc (cluster).”

However, Do I need to write a script to run on the cluster as a sbatch script? How can I use cryoSPARC instances interactively using a web browser? Thanks for the responses and advice in advance.

stephan · August 26, 2021, 3:03pm

Hi @Rajiv-Singh,

Great, if you’ve connected the cluster using the cluster_info.json and cluster_script.sh files, you’re good to go. You can now start by running a job and cryoSPARC will automatically submit a job to the cluster to be scheduled onto a GPU workstation.

In terms of accessing the user interface, take a look at this section of our guide that explains how to do it: https://guide.cryosparc.com/setup-configuration-and-management/how-to-download-install-and-configure/accessing-cryosparc
The web interface is hosted on a port on the master node. You have to either use a reverse proxy or create an SSH tunnel to your local machine in order to access it.

Rajiv-Singh · August 27, 2021, 7:13am

Hey @stephan,

I have used standard “cluster_info.json änd cluster_script.sh” to connect to cluster. However, do I need to write separate script for memory, number core, Number of GPU etc in a separate script as when I am trying to queue a job, this seems to start with request for all these parameters but return with following error: Failed to launch! 255.

Please advise to troubleshoot this.

stephan · August 27, 2021, 2:58pm

Hi @Rajiv-Singh,

No, this is what the cluster_script.sh is for. When you queue a job in cryoSPARC, the cryoSPARC scheduler will use the cluster_script.sh as a template and fill in the variables that are found in the script (e.g, num_cpu, num_gpu, etc.,) based on the job that’s being queued. Each job in cryoSPARC has its own unique set of resource requirements that may change with the number of GPUs requested.

A complete explanation is found here: Downloading and Installing CryoSPARC | CryoSPARC Guide

This is most likely an SSH error. Can you post the output of cryosparcm log command_core when this happens?

Rajiv-Singh · August 27, 2021, 6:11pm

Hey @stephan,

Here is output of cryospacm log command_core (Have edited Install Directory path and SSD/PATH):

  File "/INSTALL/DIRECTORY/cryosparc_master/cryosparc_command/command_core/__init__.py", line 124, in wrapper
    res = func(*args, **kwargs)
  File "/INSTALL/DIRECTORY/cryosparc_master/cryosparc_command/command_core/__init__.py", line 1838, in scheduler_run
    scheduler_run_core(do_run)
  File "/INSTALL/DIRECTORY/cryosparc_master/cryosparc_command/command_core/__init__.py", line 2056, in scheduler_run_core
    run_job(job['project_uid'], job['uid']) # takes care of the cluster case and the node case
  File "/INSTALL/DIRECTORY/cryosparc_master/cryosparc_command/command_core/__init__.py", line 133, in wrapper
    raise e
  File "/INSTALL/DIRECTORY/cryosparc_master/cryosparc_command/command_core/__init__.py", line 124, in wrapper
    res = func(*args, **kwargs)
  File "/INSTALL/DIRECTORY/cryosparc_master/cryosparc_command/command_core/__init__.py", line 2288, in run_job
    res = subprocess.check_output(cmd, stderr=subprocess.STDOUT, shell=True).decode()
  File "/INSTALL/DIRECTORY/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/subprocess.py", line 411, in check_output
    **kwargs).stdout
  File "/INSTALL/DIRECTORY/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/subprocess.py", line 512, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command 'ssh loginnode sbatch /SSD/PATH/10248project/P2/J26/queue_sub_script.sh' returned non-zero exit status 255.
-----------------------------------------------------
[JSONRPC ERROR  2021-08-28 02:48:51.315905  at  enqueue_job ]
-----------------------------------------------------
Traceback (most recent call last):
  File "/INSTALL/DIRECTORY/cryosparc_master/cryosparc_command/command_core/__init__.py", line 124, in wrapper
    res = func(*args, **kwargs)
  File "/INSTALL/DIRECTORY/cryosparc_master/cryosparc_command/command_core/__init__.py", line 4958, in enqueue_job
    scheduler_run()
  File "/INSTALL/DIRECTORY/cryosparc_master/cryosparc_command/command_core/__init__.py", line 133, in wrapper
    raise e
  File "/INSTALL/DIRECTORY/cryosparc_master/cryosparc_command/command_core/__init__.py", line 124, in wrapper
    res = func(*args, **kwargs)
  File "/INSTALL/DIRECTORY/cryosparc_master/cryosparc_command/command_core/__init__.py", line 1838, in scheduler_run
    scheduler_run_core(do_run)
  File "/INSTALL/DIRECTORY/cryosparc_master/cryosparc_command/command_core/__init__.py", line 2056, in scheduler_run_core
    run_job(job['project_uid'], job['uid']) # takes care of the cluster case and the node case
  File "/INSTALL/DIRECTORY/cryosparc_master/cryosparc_command/command_core/__init__.py", line 133, in wrapper
    raise e
  File "/INSTALL/DIRECTORY/cryosparc_master/cryosparc_command/command_core/__init__.py", line 124, in wrapper
    res = func(*args, **kwargs)
  File "/INSTALL/DIRECTORY/cryosparc_master/cryosparc_command/command_core/__init__.py", line 2288, in run_job
    res = subprocess.check_output(cmd, stderr=subprocess.STDOUT, shell=True).decode()
  File "/INSTALL/DIRECTORY/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/subprocess.py", line 411, in check_output
    **kwargs).stdout
  File "/INSTALL/DIRECTORY/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/subprocess.py", line 512, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command 'ssh loginnode sbatch /SSD/PATH/10248project/P2/J26/queue_sub_script.sh' returned non-zero exit status 255.
-----------------------------------------------------
Waiting for data... (interrupt to abort)

stephan · August 27, 2021, 6:18pm

Hi @Rajiv-Singh,

Thanks for sending that. In you cluster_info.json, can you remove the send_cmd_tpl value?

Rajiv-Singh · August 27, 2021, 6:35pm

Shall I remove entire line
“send_cmd_tpl” : “ssh loginnode {{ command }}”,

or keep this
“send_cmd_tpl” :

stephan · August 27, 2021, 6:50pm

You can remove the entire line, just make sure the JSON document is still valid (no missing commas).

Rajiv-Singh · August 27, 2021, 6:58pm

Sure.

Do I need to reconnect on cluster or update after editing cluster_info.jason?

stephan · August 27, 2021, 7:01pm

Yes, you need to run cryosparcm cluster connect in the same directory where the cluster_info.json and cluster_script.sh files are in order to update the cluster lane.

Rajiv-Singh · August 27, 2021, 7:22pm

Hey @stephan

Have updated cluster_info.jason and reconnected using cryospacm cluster connect from master node resulted in Successfully added cluster lane slurm cryosparc.

However, once I am submitting job from cryosparc instance on cluster lane, this fails again with following error after running cryosparcm log command_core:

  File "/INSTALL/DIRECTORY//cryosparc_master/cryosparc_command/command_core/__init__.py", line 124, in wrapper
    res = func(*args, **kwargs)
  File "/INSTALL/DIRECTORY//cryosparc_master/cryosparc_command/command_core/__init__.py", line 1838, in scheduler_run
    scheduler_run_core(do_run)
  File "/INSTALL/DIRECTORY//cryosparc_master/cryosparc_command/command_core/__init__.py", line 2056, in scheduler_run_core
    run_job(job['project_uid'], job['uid']) # takes care of the cluster case and the node case
  File "/INSTALL/DIRECTORY//cryosparc_master/cryosparc_command/command_core/__init__.py", line 133, in wrapper
    raise e
  File "/INSTALL/DIRECTORY//cryosparc_master/cryosparc_command/command_core/__init__.py", line 124, in wrapper
    res = func(*args, **kwargs)
  File "/INSTALL/DIRECTORY//cryosparc_master/cryosparc_command/command_core/__init__.py", line 2288, in run_job
    res = subprocess.check_output(cmd, stderr=subprocess.STDOUT, shell=True).decode()
  File "/INSTALL/DIRECTORY//cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/subprocess.py", line 411, in check_output
    **kwargs).stdout
  File "/INSTALL/DIRECTORY//cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/subprocess.py", line 512, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command 'sbatch /SSD/PATH/10248project/P2/J26/queue_sub_script.sh' returned non-zero exit status 1.
-----------------------------------------------------
[JSONRPC ERROR  2021-08-28 04:11:12.486976  at  enqueue_job ]
-----------------------------------------------------
Traceback (most recent call last):
  File "/INSTALL/DIRECTORY//cryosparc_master/cryosparc_command/command_core/__init__.py", line 124, in wrapper
    res = func(*args, **kwargs)
  File "/INSTALL/DIRECTORY//cryosparc_master/cryosparc_command/command_core/__init__.py", line 4958, in enqueue_job
    scheduler_run()
  File "/INSTALL/DIRECTORY//cryosparc_master/cryosparc_command/command_core/__init__.py", line 133, in wrapper
    raise e
  File "/INSTALL/DIRECTORY//cryosparc_master/cryosparc_command/command_core/__init__.py", line 124, in wrapper
    res = func(*args, **kwargs)
  File "/INSTALL/DIRECTORY//cryosparc_master/cryosparc_command/command_core/__init__.py", line 1838, in scheduler_run
    scheduler_run_core(do_run)
  File "/INSTALL/DIRECTORY//cryosparc_master/cryosparc_command/command_core/__init__.py", line 2056, in scheduler_run_core
    run_job(job['project_uid'], job['uid']) # takes care of the cluster case and the node case
  File "/INSTALL/DIRECTORY//cryosparc_master/cryosparc_command/command_core/__init__.py", line 133, in wrapper
    raise e
  File "/INSTALL/DIRECTORY//cryosparc_master/cryosparc_command/command_core/__init__.py", line 124, in wrapper
    res = func(*args, **kwargs)
  File "/INSTALL/DIRECTORY//cryosparc_master/cryosparc_command/command_core/__init__.py", line 2288, in run_job
    res = subprocess.check_output(cmd, stderr=subprocess.STDOUT, shell=True).decode()
  File "/INSTALL/DIRECTORY//cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/subprocess.py", line 411, in check_output
    **kwargs).stdout
  File "/INSTALL/DIRECTORY//cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/subprocess.py", line 512, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command 'sbatch /SSD/PATH/10248project/P2/J26/queue_sub_script.sh' returned non-zero exit status 1.
-----------------------------------------------------
Waiting for data... (interrupt to abort)

stephan · August 27, 2021, 7:27pm

This time, you can run this command directly in the shell to find out what the error is.