Job stays in Launching state forever

Hi!
I would really appreciate it if you can help me finalize my installation of cryosparc in a new worker.

Context: I have a 4gpu machine (CentOS) that is both worker and master. I bought a new 4gpu machine (Ubuntu) and want to connect it to the first machine as a worker. After I run all the installation and saw the new lane in the app, I ran a job. The job stays in launching state forever.

What I tried: I already saw this post: Job Halts in Launched State
and followed all the proposed solutions but I still have the same issue.

Do you think the problem is because one machine is CentOS and the other Ubuntu? Could you please help me solving this issue?
Thanks a lot,
Samara

Hi @Smona,

When jobs stay in “launched” status, there is usually something that went wrong when the master node tries to launch the job on the worker node. Having different OS’s is not a problem.
Can you clear the job, relaunch it, and when you notice it’s “stuck” can you send over the output of cryosparcm log command_core?

Hi @stephan,
Thanks a lot for your reply. Here is the output:

---------- Scheduler running ---------------
Jobs Queued: [(‘P82’, ‘J31’)]
Licenses currently active : 4
Now trying to schedule J31
Need slots : {‘CPU’: 2, ‘GPU’: 1, ‘RAM’: 3}
Need fixed : {‘SSD’: True}
Master direct : False
Scheduling job to quartet.mshri.on.ca
Not a commercial instance - heartbeat set to 12 hours.
Launchable! – Launching.
Changed job P82.J31 status launched
Running project UID P82 job UID J31
Running job on worker type node
Running job using: /home/cryspc/Cryosparc_worker2/cryosparc_worker/bin/cryosparcw
Running job on remote worker node hostname quartet.mshri.on.ca
cmd: bash -c "nohup /home/cryspc/Cryosparc_worker2/cryosparc_worker/bin/cryosparcw run --project P82 --job J31 --master_hostname quad.mshri.on.ca --master_command_core_port 39002 > /Cry/CryoSparcV2/P82/J31/job.log 2>&1 & "

Hi @Smona,

Can you kill and clear the job again, then try running the following command:
/home/cryspc/Cryosparc_worker2/cryosparc_worker/bin/cryosparcw run --project P82 --job J31 --master_hostname quad.mshri.on.ca --master_command_core_port 39002
and paste the output here?

Hi @stephan!

When I ran the command from the worker machine this is the output:

/home/cryspc/Cryosparc_worker2/cryosparc_worker/bin/cryosparcw run --project P82 --job J32 --master_hostname quad.mshri.on.ca --master_command_core_port 39002


================= CRYOSPARCW =======  2021-10-06 16:56:27.614953  =========
Project P82 Job J32
Master quad.mshri.on.ca Port 39002
===========================================================================
========= monitor process now starting main process
MAINPROCESS PID 67360
MAIN PID 67360
class2D.run cryosparc_compute.jobs.jobregister
========= monitor process now waiting for main process
**** handle exception rc
set status to failed
Traceback (most recent call last):
  File "cryosparc_worker/cryosparc_compute/run.py", line 71, in cryosparc_compute.run.main
  File "/home/cryspc/Cryosparc_worker2/cryosparc_worker/cryosparc_compute/jobs/jobregister.py", line 362, in get_run_function
    runmod = importlib.import_module(".."+modname, __name__)
  File "/home/cryspc/Cryosparc_worker2/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 1050, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "cryosparc_worker/cryosparc_compute/jobs/class2D/run.py", line 13, in init cryosparc_compute.jobs.class2D.run
  File "/home/cryspc/Cryosparc_worker2/cryosparc_worker/cryosparc_compute/engine/__init__.py", line 8, in <module>
    from .engine import *  # noqa
  File "cryosparc_worker/cryosparc_compute/engine/engine.py", line 9, in init cryosparc_compute.engine.engine
  File "cryosparc_worker/cryosparc_compute/engine/cuda_core.py", line 12, in init cryosparc_compute.engine.cuda_core
  File "/home/cryspc/Cryosparc_worker2/cryosparc_worker/cryosparc_compute/skcuda_internal/fft.py", line 27, in <module>
    from . import misc2 as misc
  File "/home/cryspc/Cryosparc_worker2/cryosparc_worker/cryosparc_compute/skcuda_internal/misc2.py", line 32, in <module>
    from . import cuda
  File "/home/cryspc/Cryosparc_worker2/cryosparc_worker/cryosparc_compute/skcuda_internal/cuda.py", line 17, in <module>
    from .cudadrv import *
  File "/home/cryspc/Cryosparc_worker2/cryosparc_worker/cryosparc_compute/skcuda_internal/cudadrv.py", line 39, in <module>
    raise OSError('CUDA driver library not found')
OSError: CUDA driver library not found
========= main process now complete.
========= monitor process now complete.

When I ran it from the master,the output is this:

/home/cryspc/Cryosparc_worker2/cryosparc_worker/bin/cryosparcw run --project P82 --job J32 --master_hostname quad.mshri.on.ca --master_command_core_port 39002
bash: /home/cryspc/Cryosparc_worker2/cryosparc_worker/bin/cryosparcw: No such file or directory

Hi @stephan did you have a chance to look at this? we would really appreciate it if you can help us troubleshooting.
Thanks,
Samara

Hi @Smona,

Based on the scheduler command, it doesn’t seem like this worker was connected correctly to the master instance. What arguments did you use when running the cryosparcw connect command?

Specifically:

The lack of an SSH command indicates the master node is trying to run the command on the same server, yet this is a remote machine.
See our guide for more information: https://guide.cryosparc.com/setup-configuration-and-management/how-to-download-install-and-configure/downloading-and-installing-cryosparc#connecting-a-worker-node

Hi @stephan,
Thanks for your reply. I run this command: ./bin/cryosparcw connect --worker quartet.mshri.on.ca --master quad.mshri.on.ca --port 39000 --ssdpath /ssd_cache --lane Quartet --sshstr cryspc@quad.mshri.on.ca

Also I tried this one with the worker sshstr because I wasn’t sure which one should I use. But none of them worked.
./bin/cryosparcw connect --worker quartet.mshri.on.ca --master quad.mshri.on.ca --port 39000 --ssdpath /ssd_cache --lane Quartet --sshstr cryspc@quartet.mshri.on.ca --update

Here is what I have currently:

 Updating target quartet.mshri.on.ca
  Current configuration:
               cache_path :  /ssd_cache
           cache_quota_mb :  None
         cache_reserve_mb :  10000
                     desc :  None
                     gpus :  [{'id': 0, 'mem': 12788105216, 'name': 'NVIDIA TITAN Xp'}, {'id': 1, 'mem': 12788498432, 'name': 'NVIDIA TITAN Xp'}, {'id': 2, 'mem': 12788498432, 'name': 'NVIDIA TITAN Xp'}, {'id': 3, 'mem': 12788498432, 'name': 'NVIDIA TITAN Xp'}]
                 hostname :  quartet.mshri.on.ca
                     lane :  Quartet
             monitor_port :  None
                     name :  quartet.mshri.on.ca
           resource_fixed :  {'SSD': True}
           resource_slots :  {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47], 'GPU': [0, 1, 2, 3], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]}
                  ssh_str :  cryspc@quad.mshri.on.ca
                    title :  Worker node quartet.mshri.on.ca
                     type :  node
          worker_bin_path :  /home/cryspc/Cryosparc_worker2/cryosparc_worker/bin/cryosparcw
 ---------------------------------------------------------------
  SSH connection string will be updated to cryspc@quartet.mshri.on.ca
  SSD will be enabled.
  Worker will be registered with SSD cache location /ssd_cache
  SSD path will be updated to /ssd_cache
  Worker will be reassigned to lane Quartet
 ---------------------------------------------------------------
  Updating..
  Done.
 ---------------------------------------------------------------
  Final configuration for quartet.mshri.on.ca
               cache_path :  /ssd_cache
           cache_quota_mb :  None
         cache_reserve_mb :  10000
                     desc :  None
                     gpus :  [{'id': 0, 'mem': 12788105216, 'name': 'NVIDIA TITAN Xp'}, {'id': 1, 'mem': 12788498432, 'name': 'NVIDIA TITAN Xp'}, {'id': 2, 'mem': 12788498432, 'name': 'NVIDIA TITAN Xp'}, {'id': 3, 'mem': 12788498432, 'name': 'NVIDIA TITAN Xp'}]
                 hostname :  quartet.mshri.on.ca
                     lane :  Quartet
             monitor_port :  None
                     name :  quartet.mshri.on.ca
           resource_fixed :  {'SSD': True}
           resource_slots :  {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47], 'GPU': [0, 1, 2, 3], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]}
                  ssh_str :  cryspc@quartet.mshri.on.ca
                    title :  Worker node quartet.mshri.on.ca
                     type :  node
          worker_bin_path :  /home/cryspc/Cryosparc_worker2/cryosparc_worker/bin/cryosparcw
 ---------------------------------------------------------------

Thanks, can you take a screenshot of the “Instance Information” tab in the Resource Manager?

@stephan here is the info:

Hi @stephan,
Hope you are doing fine. Here is an update:
I disconnected the lane and deleted the cryosparc installation in the new worker machine (quartet).
Then I installed the standalone package in quartet to check if there was any issue related to the machine specifically. I was able to install it successfully and also create a project and ran some jobs. So I guess the problem is the connection between both.
Still, I don’t think this is the best setup for our lab. So I would like to check with you if there is a way to connect the two standalone computers or do you have a hint of what might be the issue of one not be able to connect with the other?

P.D. If you are fine with it, we can arrange a quick call to address this issue. All our lab would really appreciate it.

Thanks for your advice.
Samara

Hey @Smona,

No problem, let’s start from the beginning.
If we’re trying to set up cryoSPARC in the master-worker architecture, you need to ensure three things:
(More information on our guide)

  1. All nodes have access to a shared file system. This file system is where the project directories are located, allowing all nodes to read and write intermediate results as jobs start and complete.

  2. The master node has password-less SSH access to each of the worker nodes. SSH is used to execute jobs on the worker nodes from the master node.

  3. All worker nodes have TCP access to 10 consecutive ports on the master node (default ports are 39000-39010). These ports are used for metadata communication via HTTP Remote Procedure Call (RPC) based API requests.

Once those are confirmed, install cryoSPARC (master and worker) in a directory that is available on both instances (e.g., /u/cryosparcuser/cryosparc/cryosparc_master and /u/cryosparcuser/cryosparc/cryosparc_worker).
Now, you can connect the worker back to the main instance (by using the same ./cryosparcw connect command).
(Note that if you want to start the connection from scratch, you can remove a lane by running the command: cryosparcm cli "remove_scheduler_lane('Quartet')")

Hi @stephan
Hope you are doing fine. Thanks a lot for the detailed explanation. I want to ask you a couple of questions on some issues I encountered.
At the moment, We have the master installation in a folder that is not accessible by the other new worker. I think that was the reason why the worker was not able to be connected properly. Therefore, we want to move the master installation to a shared location so that the new worker can access it.
How is the best way to move the installation to another location without messing up any of the projects created? Could I just move the installation folders to the new location and then keep working on the projects without any issues?
Thanks a lot for your help.
Best,
Samara