I have cryosparc installed and setup to run on gpu node of our cluster setup. But there are other nodes available and I want to be able to select the node to queue the job to. Since not all jobs require gpu, it would be more efficient to send each job to a different partition. How do I set this up?
Hi @73km,
I’d like to understand you issue in a bit more detail. Have you currently setup cryoSPARC on a one cluster and you’re looking to split up the cluster into specific nodes, or do you already have your nodes as separate workers?
Addy.
So in our cluster setup there are multiple partitions/nodes. Over a hundred CPU nodes and only 2 GPU nodes. While setting up cryosparc, I set partition=gpu [for the gpu node] because that’s where cuda drivers were and cryosparc can make use of gpu. But I realize not all jobs require GPUs. So now I want to be able to select which partition to send the job to based on the requirement. In tutorial videos, I see option to select different lanes while queuing jobs. It would be nice if I can make those lanes option with different partitions we have.
Hi @73km,
Do you use slurm on your cluster?
if you have some lane definition for the GPU partition, just copy that and modify:
- The
cluster_info.json
file so that “name” is different (that will be the name of your new line) - The default
cluster_script.sh
so that the parameter
#SBATCH --partition=your_gpu_slurm_partition
points to your new partition (your_cpu_partition
).
Then run cryosparcm cluster connect
and your new lane should be visible on your web.
You will still need to choose the lane when you launch the job…
Best,
Juan
Thanks! It worked! I wish I could put multiple of these cluster scripts so that I can monitor them simultaneously. It seems I can put only one file at a time.
Hi @73km,
If I’m understanding your post correctly, you’d like to have a copy of each set of cluster integration scripts.
You can do the following:
Create a folder called cryosparc_cluster_scripts
adjacent to cryosparc_master
and cryosparc_worker
. Inside this folder, create a new folder for each partition (you can run cryosparcm cluster dump <cluster name>
inside each folder to get cryoSPARC to write out the two files for that cluster integration):
- cryosparc_cluster_scripts
- partition1
- cluster_info.json
- cluster_script.sh
- partition2
- cluster_info.json
- cluster_script.sh
- partition3
- cluster_info.json
- cluster_script.sh
When you run cryosparcm cluster connect
, it reads the two files required from your current working directory. Whenever you want to make modifications to any of the scripts, navigate to the folder of the partition you want to modify, modify the files, then run the commands.
This way you don’t have to keep modifying a single set of files in order to update each configuration.
Hey Stephan,
I have installed cryoSPARC Master on one of our login nodes. I am wondering where do I install cryoSPARC Worker (Same login node or on one of the cluster nodes with GPU) and then run the cluster_info_jason as well as cluster_script.sh.
If I am using one of the GPU nodes for installing cryoSPARC worker node, the issue of hostname arises at different GPU node (when slurm assign different GPU node than one onto which worker installed).
How to resolve Hostname issues when I have to use various GPU compute nodes for cryoSPARC worker nodes?
Kindly advise.
Best,
Rajiv
Hi @Rajiv-Singh,
Take a look at our Hardware and Software Requirements guide: https://guide.cryosparc.com/setup-configuration-and-management/hardware-and-system-requirements
One of the requirements for a cluster system is to have a shared storage volume across the worker nodes and master nodes. This means that when you install the worker, it will be available to all the nodes in the cluster at the same location. You have to install the worker node on one of the GPU nodes (mainly because the installer requires the CUDA toolkit to compile one if its dependencies). Once that’s done, you set up the cluster integration, explained here: https://guide.cryosparc.com/setup-configuration-and-management/how-to-download-install-and-configure/downloading-and-installing-cryosparc#connect-a-cluster-to-cryosparc
You don’t have to worry about managing hostnames, as the cluster will do that for you. CryoSPARC will only submit jobs to the cluster, and the cluster will take care of the rest.
Hi @stephan,
I have followed the same guidelines and installed master and worker on shared storage volumes. Installed Master on one of our login nodes and worker on GPU node and connected on the cluster using cluster_info.json cluster_script.sh incorporating parameter of our cluster resulted in added a “Lane slurm cryosparc (cluster).”
However, Do I need to write a script to run on the cluster as a sbatch script? How can I use cryoSPARC instances interactively using a web browser? Thanks for the responses and advice in advance.
Hi @Rajiv-Singh,
Great, if you’ve connected the cluster using the cluster_info.json and cluster_script.sh files, you’re good to go. You can now start by running a job and cryoSPARC will automatically submit a job to the cluster to be scheduled onto a GPU workstation.
In terms of accessing the user interface, take a look at this section of our guide that explains how to do it: https://guide.cryosparc.com/setup-configuration-and-management/how-to-download-install-and-configure/accessing-cryosparc
The web interface is hosted on a port on the master node. You have to either use a reverse proxy or create an SSH tunnel to your local machine in order to access it.
Hey @stephan,
I have used standard “cluster_info.json änd cluster_script.sh” to connect to cluster. However, do I need to write separate script for memory, number core, Number of GPU etc in a separate script as when I am trying to queue a job, this seems to start with request for all these parameters but return with following error: Failed to launch! 255.
Please advise to troubleshoot this.
Hi @Rajiv-Singh,
No, this is what the cluster_script.sh
is for. When you queue a job in cryoSPARC, the cryoSPARC scheduler will use the cluster_script.sh
as a template and fill in the variables that are found in the script (e.g, num_cpu
, num_gpu
, etc.,) based on the job that’s being queued. Each job in cryoSPARC has its own unique set of resource requirements that may change with the number of GPUs requested.
A complete explanation is found here: https://guide.cryosparc.com/setup-configuration-and-management/how-to-download-install-and-configure/downloading-and-installing-cryosparc#create-the-files
This is most likely an SSH error. Can you post the output of cryosparcm log command_core
when this happens?
Hey @stephan,
Here is output of cryospacm log command_core (Have edited Install Directory path and SSD/PATH):
File "/INSTALL/DIRECTORY/cryosparc_master/cryosparc_command/command_core/__init__.py", line 124, in wrapper
res = func(*args, **kwargs)
File "/INSTALL/DIRECTORY/cryosparc_master/cryosparc_command/command_core/__init__.py", line 1838, in scheduler_run
scheduler_run_core(do_run)
File "/INSTALL/DIRECTORY/cryosparc_master/cryosparc_command/command_core/__init__.py", line 2056, in scheduler_run_core
run_job(job['project_uid'], job['uid']) # takes care of the cluster case and the node case
File "/INSTALL/DIRECTORY/cryosparc_master/cryosparc_command/command_core/__init__.py", line 133, in wrapper
raise e
File "/INSTALL/DIRECTORY/cryosparc_master/cryosparc_command/command_core/__init__.py", line 124, in wrapper
res = func(*args, **kwargs)
File "/INSTALL/DIRECTORY/cryosparc_master/cryosparc_command/command_core/__init__.py", line 2288, in run_job
res = subprocess.check_output(cmd, stderr=subprocess.STDOUT, shell=True).decode()
File "/INSTALL/DIRECTORY/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/subprocess.py", line 411, in check_output
**kwargs).stdout
File "/INSTALL/DIRECTORY/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/subprocess.py", line 512, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command 'ssh loginnode sbatch /SSD/PATH/10248project/P2/J26/queue_sub_script.sh' returned non-zero exit status 255.
-----------------------------------------------------
[JSONRPC ERROR 2021-08-28 02:48:51.315905 at enqueue_job ]
-----------------------------------------------------
Traceback (most recent call last):
File "/INSTALL/DIRECTORY/cryosparc_master/cryosparc_command/command_core/__init__.py", line 124, in wrapper
res = func(*args, **kwargs)
File "/INSTALL/DIRECTORY/cryosparc_master/cryosparc_command/command_core/__init__.py", line 4958, in enqueue_job
scheduler_run()
File "/INSTALL/DIRECTORY/cryosparc_master/cryosparc_command/command_core/__init__.py", line 133, in wrapper
raise e
File "/INSTALL/DIRECTORY/cryosparc_master/cryosparc_command/command_core/__init__.py", line 124, in wrapper
res = func(*args, **kwargs)
File "/INSTALL/DIRECTORY/cryosparc_master/cryosparc_command/command_core/__init__.py", line 1838, in scheduler_run
scheduler_run_core(do_run)
File "/INSTALL/DIRECTORY/cryosparc_master/cryosparc_command/command_core/__init__.py", line 2056, in scheduler_run_core
run_job(job['project_uid'], job['uid']) # takes care of the cluster case and the node case
File "/INSTALL/DIRECTORY/cryosparc_master/cryosparc_command/command_core/__init__.py", line 133, in wrapper
raise e
File "/INSTALL/DIRECTORY/cryosparc_master/cryosparc_command/command_core/__init__.py", line 124, in wrapper
res = func(*args, **kwargs)
File "/INSTALL/DIRECTORY/cryosparc_master/cryosparc_command/command_core/__init__.py", line 2288, in run_job
res = subprocess.check_output(cmd, stderr=subprocess.STDOUT, shell=True).decode()
File "/INSTALL/DIRECTORY/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/subprocess.py", line 411, in check_output
**kwargs).stdout
File "/INSTALL/DIRECTORY/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/subprocess.py", line 512, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command 'ssh loginnode sbatch /SSD/PATH/10248project/P2/J26/queue_sub_script.sh' returned non-zero exit status 255.
-----------------------------------------------------
Waiting for data... (interrupt to abort)
Hi @Rajiv-Singh,
Thanks for sending that. In you cluster_info.json, can you remove the send_cmd_tpl
value?
Shall I remove entire line
“send_cmd_tpl” : “ssh loginnode {{ command }}”,
or keep this
“send_cmd_tpl” :
You can remove the entire line, just make sure the JSON document is still valid (no missing commas).
Sure.
Do I need to reconnect on cluster or update after editing cluster_info.jason?
Yes, you need to run cryosparcm cluster connect
in the same directory where the cluster_info.json
and cluster_script.sh
files are in order to update the cluster lane.
Hey @stephan
Have updated cluster_info.jason and reconnected using cryospacm cluster connect from master node resulted in Successfully added cluster lane slurm cryosparc.
However, once I am submitting job from cryosparc instance on cluster lane, this fails again with following error after running cryosparcm log command_core:
File "/INSTALL/DIRECTORY//cryosparc_master/cryosparc_command/command_core/__init__.py", line 124, in wrapper
res = func(*args, **kwargs)
File "/INSTALL/DIRECTORY//cryosparc_master/cryosparc_command/command_core/__init__.py", line 1838, in scheduler_run
scheduler_run_core(do_run)
File "/INSTALL/DIRECTORY//cryosparc_master/cryosparc_command/command_core/__init__.py", line 2056, in scheduler_run_core
run_job(job['project_uid'], job['uid']) # takes care of the cluster case and the node case
File "/INSTALL/DIRECTORY//cryosparc_master/cryosparc_command/command_core/__init__.py", line 133, in wrapper
raise e
File "/INSTALL/DIRECTORY//cryosparc_master/cryosparc_command/command_core/__init__.py", line 124, in wrapper
res = func(*args, **kwargs)
File "/INSTALL/DIRECTORY//cryosparc_master/cryosparc_command/command_core/__init__.py", line 2288, in run_job
res = subprocess.check_output(cmd, stderr=subprocess.STDOUT, shell=True).decode()
File "/INSTALL/DIRECTORY//cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/subprocess.py", line 411, in check_output
**kwargs).stdout
File "/INSTALL/DIRECTORY//cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/subprocess.py", line 512, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command 'sbatch /SSD/PATH/10248project/P2/J26/queue_sub_script.sh' returned non-zero exit status 1.
-----------------------------------------------------
[JSONRPC ERROR 2021-08-28 04:11:12.486976 at enqueue_job ]
-----------------------------------------------------
Traceback (most recent call last):
File "/INSTALL/DIRECTORY//cryosparc_master/cryosparc_command/command_core/__init__.py", line 124, in wrapper
res = func(*args, **kwargs)
File "/INSTALL/DIRECTORY//cryosparc_master/cryosparc_command/command_core/__init__.py", line 4958, in enqueue_job
scheduler_run()
File "/INSTALL/DIRECTORY//cryosparc_master/cryosparc_command/command_core/__init__.py", line 133, in wrapper
raise e
File "/INSTALL/DIRECTORY//cryosparc_master/cryosparc_command/command_core/__init__.py", line 124, in wrapper
res = func(*args, **kwargs)
File "/INSTALL/DIRECTORY//cryosparc_master/cryosparc_command/command_core/__init__.py", line 1838, in scheduler_run
scheduler_run_core(do_run)
File "/INSTALL/DIRECTORY//cryosparc_master/cryosparc_command/command_core/__init__.py", line 2056, in scheduler_run_core
run_job(job['project_uid'], job['uid']) # takes care of the cluster case and the node case
File "/INSTALL/DIRECTORY//cryosparc_master/cryosparc_command/command_core/__init__.py", line 133, in wrapper
raise e
File "/INSTALL/DIRECTORY//cryosparc_master/cryosparc_command/command_core/__init__.py", line 124, in wrapper
res = func(*args, **kwargs)
File "/INSTALL/DIRECTORY//cryosparc_master/cryosparc_command/command_core/__init__.py", line 2288, in run_job
res = subprocess.check_output(cmd, stderr=subprocess.STDOUT, shell=True).decode()
File "/INSTALL/DIRECTORY//cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/subprocess.py", line 411, in check_output
**kwargs).stdout
File "/INSTALL/DIRECTORY//cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/subprocess.py", line 512, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command 'sbatch /SSD/PATH/10248project/P2/J26/queue_sub_script.sh' returned non-zero exit status 1.
-----------------------------------------------------
Waiting for data... (interrupt to abort)
This time, you can run this command directly in the shell to find out what the error is.