Job stays in Launching state forever

Smona · September 23, 2021, 6:47pm

Hi!
I would really appreciate it if you can help me finalize my installation of cryosparc in a new worker.

Context: I have a 4gpu machine (CentOS) that is both worker and master. I bought a new 4gpu machine (Ubuntu) and want to connect it to the first machine as a worker. After I run all the installation and saw the new lane in the app, I ran a job. The job stays in launching state forever.

What I tried: I already saw this post: Job Halts in Launched State
and followed all the proposed solutions but I still have the same issue.

Do you think the problem is because one machine is CentOS and the other Ubuntu? Could you please help me solving this issue?
Thanks a lot,
Samara

stephan · October 5, 2021, 10:12pm

Hi @Smona,

When jobs stay in “launched” status, there is usually something that went wrong when the master node tries to launch the job on the worker node. Having different OS’s is not a problem.
Can you clear the job, relaunch it, and when you notice it’s “stuck” can you send over the output of cryosparcm log command_core?

Smona · October 6, 2021, 4:56pm

Hi @stephan,
Thanks a lot for your reply. Here is the output:

---------- Scheduler running ---------------
Jobs Queued: [(‘P82’, ‘J31’)]
Licenses currently active : 4
Now trying to schedule J31
Need slots : {‘CPU’: 2, ‘GPU’: 1, ‘RAM’: 3}
Need fixed : {‘SSD’: True}
Master direct : False
Scheduling job to quartet.mshri.on.ca
Not a commercial instance - heartbeat set to 12 hours.
Launchable! – Launching.
Changed job P82.J31 status launched
Running project UID P82 job UID J31
Running job on worker type node
Running job using: /home/cryspc/Cryosparc_worker2/cryosparc_worker/bin/cryosparcw
Running job on remote worker node hostname quartet.mshri.on.ca
cmd: bash -c "nohup /home/cryspc/Cryosparc_worker2/cryosparc_worker/bin/cryosparcw run --project P82 --job J31 --master_hostname quad.mshri.on.ca --master_command_core_port 39002 > /Cry/CryoSparcV2/P82/J31/job.log 2>&1 & "

stephan · October 6, 2021, 8:50pm

Hi @Smona,

Can you kill and clear the job again, then try running the following command:
/home/cryspc/Cryosparc_worker2/cryosparc_worker/bin/cryosparcw run --project P82 --job J31 --master_hostname quad.mshri.on.ca --master_command_core_port 39002
and paste the output here?

Smona · October 6, 2021, 9:01pm

Hi @stephan!

When I ran the command from the worker machine this is the output:

/home/cryspc/Cryosparc_worker2/cryosparc_worker/bin/cryosparcw run --project P82 --job J32 --master_hostname quad.mshri.on.ca --master_command_core_port 39002


================= CRYOSPARCW =======  2021-10-06 16:56:27.614953  =========
Project P82 Job J32
Master quad.mshri.on.ca Port 39002
===========================================================================
========= monitor process now starting main process
MAINPROCESS PID 67360
MAIN PID 67360
class2D.run cryosparc_compute.jobs.jobregister
========= monitor process now waiting for main process
**** handle exception rc
set status to failed
Traceback (most recent call last):
  File "cryosparc_worker/cryosparc_compute/run.py", line 71, in cryosparc_compute.run.main
  File "/home/cryspc/Cryosparc_worker2/cryosparc_worker/cryosparc_compute/jobs/jobregister.py", line 362, in get_run_function
    runmod = importlib.import_module(".."+modname, __name__)
  File "/home/cryspc/Cryosparc_worker2/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 1050, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "cryosparc_worker/cryosparc_compute/jobs/class2D/run.py", line 13, in init cryosparc_compute.jobs.class2D.run
  File "/home/cryspc/Cryosparc_worker2/cryosparc_worker/cryosparc_compute/engine/__init__.py", line 8, in <module>
    from .engine import *  # noqa
  File "cryosparc_worker/cryosparc_compute/engine/engine.py", line 9, in init cryosparc_compute.engine.engine
  File "cryosparc_worker/cryosparc_compute/engine/cuda_core.py", line 12, in init cryosparc_compute.engine.cuda_core
  File "/home/cryspc/Cryosparc_worker2/cryosparc_worker/cryosparc_compute/skcuda_internal/fft.py", line 27, in <module>
    from . import misc2 as misc
  File "/home/cryspc/Cryosparc_worker2/cryosparc_worker/cryosparc_compute/skcuda_internal/misc2.py", line 32, in <module>
    from . import cuda
  File "/home/cryspc/Cryosparc_worker2/cryosparc_worker/cryosparc_compute/skcuda_internal/cuda.py", line 17, in <module>
    from .cudadrv import *
  File "/home/cryspc/Cryosparc_worker2/cryosparc_worker/cryosparc_compute/skcuda_internal/cudadrv.py", line 39, in <module>
    raise OSError('CUDA driver library not found')
OSError: CUDA driver library not found
========= main process now complete.
========= monitor process now complete.

When I ran it from the master,the output is this:

/home/cryspc/Cryosparc_worker2/cryosparc_worker/bin/cryosparcw run --project P82 --job J32 --master_hostname quad.mshri.on.ca --master_command_core_port 39002
bash: /home/cryspc/Cryosparc_worker2/cryosparc_worker/bin/cryosparcw: No such file or directory

Smona · October 8, 2021, 1:06pm

Hi @stephan did you have a chance to look at this? we would really appreciate it if you can help us troubleshooting.
Thanks,
Samara

stephan · October 8, 2021, 2:46pm

Hi @Smona,

Based on the scheduler command, it doesn’t seem like this worker was connected correctly to the master instance. What arguments did you use when running the cryosparcw connect command?

Specifically:

The lack of an SSH command indicates the master node is trying to run the command on the same server, yet this is a remote machine.
See our guide for more information: Downloading and Installing CryoSPARC | CryoSPARC Guide

Smona · October 8, 2021, 3:10pm

Hi @stephan,
Thanks for your reply. I run this command: ./bin/cryosparcw connect --worker quartet.mshri.on.ca --master quad.mshri.on.ca --port 39000 --ssdpath /ssd_cache --lane Quartet --sshstr cryspc@quad.mshri.on.ca

Also I tried this one with the worker sshstr because I wasn’t sure which one should I use. But none of them worked.
./bin/cryosparcw connect --worker quartet.mshri.on.ca --master quad.mshri.on.ca --port 39000 --ssdpath /ssd_cache --lane Quartet --sshstr cryspc@quartet.mshri.on.ca --update

Here is what I have currently:

 Updating target quartet.mshri.on.ca
  Current configuration:
               cache_path :  /ssd_cache
           cache_quota_mb :  None
         cache_reserve_mb :  10000
                     desc :  None
                     gpus :  [{'id': 0, 'mem': 12788105216, 'name': 'NVIDIA TITAN Xp'}, {'id': 1, 'mem': 12788498432, 'name': 'NVIDIA TITAN Xp'}, {'id': 2, 'mem': 12788498432, 'name': 'NVIDIA TITAN Xp'}, {'id': 3, 'mem': 12788498432, 'name': 'NVIDIA TITAN Xp'}]
                 hostname :  quartet.mshri.on.ca
                     lane :  Quartet
             monitor_port :  None
                     name :  quartet.mshri.on.ca
           resource_fixed :  {'SSD': True}
           resource_slots :  {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47], 'GPU': [0, 1, 2, 3], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]}
                  ssh_str :  cryspc@quad.mshri.on.ca
                    title :  Worker node quartet.mshri.on.ca
                     type :  node
          worker_bin_path :  /home/cryspc/Cryosparc_worker2/cryosparc_worker/bin/cryosparcw
 ---------------------------------------------------------------
  SSH connection string will be updated to cryspc@quartet.mshri.on.ca
  SSD will be enabled.
  Worker will be registered with SSD cache location /ssd_cache
  SSD path will be updated to /ssd_cache
  Worker will be reassigned to lane Quartet
 ---------------------------------------------------------------
  Updating..
  Done.
 ---------------------------------------------------------------
  Final configuration for quartet.mshri.on.ca
               cache_path :  /ssd_cache
           cache_quota_mb :  None
         cache_reserve_mb :  10000
                     desc :  None
                     gpus :  [{'id': 0, 'mem': 12788105216, 'name': 'NVIDIA TITAN Xp'}, {'id': 1, 'mem': 12788498432, 'name': 'NVIDIA TITAN Xp'}, {'id': 2, 'mem': 12788498432, 'name': 'NVIDIA TITAN Xp'}, {'id': 3, 'mem': 12788498432, 'name': 'NVIDIA TITAN Xp'}]
                 hostname :  quartet.mshri.on.ca
                     lane :  Quartet
             monitor_port :  None
                     name :  quartet.mshri.on.ca
           resource_fixed :  {'SSD': True}
           resource_slots :  {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47], 'GPU': [0, 1, 2, 3], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]}
                  ssh_str :  cryspc@quartet.mshri.on.ca
                    title :  Worker node quartet.mshri.on.ca
                     type :  node
          worker_bin_path :  /home/cryspc/Cryosparc_worker2/cryosparc_worker/bin/cryosparcw
 ---------------------------------------------------------------

stephan · October 8, 2021, 4:50pm

Thanks, can you take a screenshot of the “Instance Information” tab in the Resource Manager?

Smona · October 8, 2021, 6:24pm

@stephan here is the info:

Smona · October 13, 2021, 6:28pm

Hi @stephan,
Hope you are doing fine. Here is an update:
I disconnected the lane and deleted the cryosparc installation in the new worker machine (quartet).
Then I installed the standalone package in quartet to check if there was any issue related to the machine specifically. I was able to install it successfully and also create a project and ran some jobs. So I guess the problem is the connection between both.
Still, I don’t think this is the best setup for our lab. So I would like to check with you if there is a way to connect the two standalone computers or do you have a hint of what might be the issue of one not be able to connect with the other?

P.D. If you are fine with it, we can arrange a quick call to address this issue. All our lab would really appreciate it.

Thanks for your advice.
Samara

stephan · October 14, 2021, 1:42pm

Hey @Smona,

No problem, let’s start from the beginning.
If we’re trying to set up cryoSPARC in the master-worker architecture, you need to ensure three things:
(More information on our guide)

All nodes have access to a shared file system. This file system is where the project directories are located, allowing all nodes to read and write intermediate results as jobs start and complete.
The master node has password-less SSH access to each of the worker nodes. SSH is used to execute jobs on the worker nodes from the master node.
All worker nodes have TCP access to 10 consecutive ports on the master node (default ports are 39000-39010). These ports are used for metadata communication via HTTP Remote Procedure Call (RPC) based API requests.

Once those are confirmed, install cryoSPARC (master and worker) in a directory that is available on both instances (e.g., /u/cryosparcuser/cryosparc/cryosparc_master and /u/cryosparcuser/cryosparc/cryosparc_worker).
Now, you can connect the worker back to the main instance (by using the same ./cryosparcw connect command).
(Note that if you want to start the connection from scratch, you can remove a lane by running the command: cryosparcm cli "remove_scheduler_lane('Quartet')")

Smona · October 22, 2021, 5:00pm

Hi @stephan
Hope you are doing fine. Thanks a lot for the detailed explanation. I want to ask you a couple of questions on some issues I encountered.
At the moment, We have the master installation in a folder that is not accessible by the other new worker. I think that was the reason why the worker was not able to be connected properly. Therefore, we want to move the master installation to a shared location so that the new worker can access it.
How is the best way to move the installation to another location without messing up any of the projects created? Could I just move the installation folders to the new location and then keep working on the projects without any issues?
Thanks a lot for your help.
Best,
Samara

marcel · October 26, 2021, 11:19am

Hi!

I have the same issue which I’ve reported here:

https://discuss.cryosparc.com/t/launched-state-on-jobs-that-require-gpus/7480

In my case, the output of the cryosparcm log command_core looks like this:

---------- Scheduler running ---------------
Jobs Queued: [(‘P1’, ‘J25’)]
Licenses currently active : 2
Now trying to schedule J25
Need slots : {‘CPU’: 12, ‘GPU’: 2, ‘RAM’: 4}
Need fixed : {‘SSD’: False}
Master direct : False
Queue status : waiting_resources
Queue message : GPU not available

I see that there is a Queue message that GPU not available, but worker node sees GPUs as follows:

Detected 2 CUDA devices.

id pci-bus name
   0      0000:01:00.0  NVIDIA GeForce RTX 3090
   1      0000:21:00.0  NVIDIA GeForce RTX 3090

What is more, when I start jobs that do not require GPUs from the master node, all jobs complete. What is strange, when I try to run same jobs from the worker node (via CLI), then I have things like that:

================= CRYOSPARCW =======  2021-10-22 14:22:52.780766  =========
Project P1 Job J34
Master michal-all-series Port 31002
===========================================================================
========= monitor process now starting main process
MAINPROCESS PID 9096
MAIN PID 9096
imports.run cryosparc_compute.jobs.jobregister
========= monitor process now waiting for main process
***************************************************************
**** handle exception rc
set status to failed
Traceback (most recent call last):
  File "cryosparc_worker/cryosparc_compute/run.py", line 84, in cryosparc_compute.run.main
  File "/home/michal/Apps/cryosparc/cryosparc_worker/cryosparc_compute/jobs/imports/run.py", line 585, in run_import_movies_or_micrographs
    assert len(all_abs_paths) > 0, "No files match!"
AssertionError: No files match!
========= main process now complete.
========= monitor process now complete.

As I’ve mentioned, I am not able to run jobs from the master node on the worker node when GPU is required. When I try to run jobs from CLI on the worker node, then I’ve got errors like that:

================= CRYOSPARCW =======  2021-10-22 14:23:46.991599  =========
Project P1 Job J31
Master michal-all-series Port 31002
===========================================================================
========= monitor process now starting main process
MAINPROCESS PID 9152
MAIN PID 9152
motioncorrection.run_patch cryosparc_compute.jobs.jobregister
========= monitor process now waiting for main process
***************************************************************
Running job on hostname %s 153.19.19.156
Allocated Resources :  {'fixed': {'SSD': False}, 'hostname': '153.19.19.156', 'lane': 'lane_3', 'lane_type': 'lane_3', 'license': True, 'licenses_acquired': 1, 'slots': {'CPU': [12, 13, 14, 15, 16, 17], 'GPU': [1], 'RAM': [4, 5]}, 'target': {'cache_path': None, 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 25438715904, 'name': 'NVIDIA GeForce RTX 3090'}, {'id': 1, 'mem': 25447170048, 'name': 'NVIDIA GeForce RTX 3090'}], 'hostname': '153.19.19.156', 'lane': 'lane_3', 'monitor_port': None, 'name': '153.19.19.156', 'resource_fixed': {'SSD': False}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], 'GPU': [0, 1], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]}, 'ssh_str': 'michal@153.19.19.156', 'title': 'Worker node 153.19.19.156', 'type': 'node', 'worker_bin_path': '/home/michal/Apps/cryosparc/cryosparc_worker/bin/cryosparcw'}}
WARNING: Treating empty result movies.0.movie_blob as an empty dataset! You probably forgot to output something in a connected job!
WARNING: Treating empty result movies.0.gain_ref_blob as an empty dataset! You probably forgot to output something in a connected job!
WARNING: Treating empty result movies.0.mscope_params as an empty dataset! You probably forgot to output something in a connected job!
**** handle exception rc
set status to failed
Traceback (most recent call last):
  File "cryosparc_worker/cryosparc_compute/run.py", line 84, in cryosparc_compute.run.main
  File "cryosparc_worker/cryosparc_compute/jobs/motioncorrection/run_patch.py", line 44, in cryosparc_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi
AttributeError: 'NoneType' object has no attribute 'add_fields'
========= main process now complete.
========= monitor process now complete.

Do you have any ideas how to solve that?

Best regards,
Marcel

Smona · October 27, 2021, 6:13pm

Hi @stephan
Just an update on what happened to the installation.
I found out that the master and worker were installed in a folder that is not accessible from one computer to the other. So I did the following:

I uninstalled the program of one of the computers by deleting the folder and removing the “Added by Cryosparc” lines in the bashrc file.
I went to a location that is shared by both computers and install the master and worker separately (not as a standalone). I faced three errors during this process (the master didn’t generate the new admin user, the worker -that is in the same machine- was unable to connect, and I was not able to open the app).
I delete this installation because of the errors (the deletion was not complete, some folders “deps” folders were unremovable giving an input/output error, anyways I left them there) and reinstalled it in another shared location using the standalone process (since I suspected the issue was that the location I picked is also a shared location with our institute cluster). This worked. I ran some jobs, unfortunately, some of them (Patch Motion Correction and 2D class) gave me this error: AssertionError: Child process with PID 1468305 has terminated unexpectedly!.
Anyways, I thought I could solve that problem later so I tried to connect a second worker to this standalone machine.
The issue is that the other machine that I wanted to connect has already a Cryosparc standalone installation in a non-shared location, this time instead of deleting the installation in the second machine I stopped the program instead, then went to the shared folder directory but this time through the user of the second machine and try the cryosparcm connect command. It is important to note that I entered to the same cryosparc_worker folder from the other installation. I wasn’t sure if I needed to make a second cryosparc_worker folder or use the one that already existed.
The connection didn’t work. Then I tried to make a new folder in the same shared filesystem to download a new cryosparc_worker tar. I did that and install it using the same license as the other master machine followed by the cryosparc connect function. This didn’t work either.
At the end I deleted the installation of the second worker (the one in the shared folder) and restarted the old installation that I stopped at the beginning which is working fine.
As for the master machine I deleted all the installations and reinstall it in the same location the one that is not in a shared folder. This installation gave me the same AssertionError: Child process with PID 1468305 has terminated unexpectedly! which was an error that was not present before.

In conclusion: I think I messed up something in the master machine, which is causing the AssertionError. I am super confused about what to do next. I have spent months trying to connect these two machines and nothing is working. How could we get some assistance from cryosparc that is more effective than through writing the blog? Is there somebody that can help us with this on-site we are located in Toronto?. This would be really appreciated.
If this is not possible, could you please suggest something else that I can do?
Thanks a lot,
Samara