Cryosparc worker issues

zahodnbd · December 8, 2022, 1:52pm

Dear CryoSPARC team,

I am running into an issue with my CryoSPARC installation (seems to be the worker) after upgrading to v4.0.x (originally 4.0.1 and now 4.0.3). Both worker and master are in the same version and both have the correct license ID in their config.sh files. I have also forced an override update for the cryosparc_worker and disconnected it and reconnected it from the master to no avail.

It appears that the jobs that require a worker (E.g. Extract from Micrographs) are hanging (originally they were hanging in the launched state, but after fixing the cuda path to >10 they are now hanging in the running state). Here is the current output of the Event Log of CryoSPARC, in which the job is hanging:

License is valid.

Launching job on lane default target em504-02.ibex.kaust.edu.sa ...

Running job on master node hostname em504-02.ibex.kaust.edu.sa

[CPU: 85.3 MB]
Job J142 Started

[CPU: 85.4 MB]
Master running v4.0.3, worker running v4.0.3

[CPU: 85.6 MB]
Working in directory: /ibex/scratch/projects/c2121/Brandon/cryosparc_datasets/P5/J142

[CPU: 85.6 MB]
Running on lane default

[CPU: 85.6 MB]
Resources allocated: 

[CPU: 85.6 MB]
  Worker:  em504-02.ibex.kaust.edu.sa

[CPU: 85.6 MB]
  CPU   :  [0, 1, 2, 3]

[CPU: 85.6 MB]
  GPU   :  [0, 1]

[CPU: 85.6 MB]
  RAM   :  [0]

[CPU: 85.6 MB]
  SSD   :  False

[CPU: 85.6 MB]
--------------------------------------------------------------

[CPU: 85.6 MB]
Importing job module for job type extract_micrographs_multi...

[CPU: 205.7 MB]
Job ready to run

[CPU: 205.7 MB]
***************************************************************

[CPU: 425.5 MB]
Particles do not have CTF estimates but micrographs do: micrograph CTFs will be recorded in particles output.

[CPU: 470.2 MB]
Collecting micrograph particle selection information...

[CPU: 720.8 MB]
Starting multithreaded pipeline ... 

[CPU: 721.0 MB]
Started pipeline

Testing the installation with “cryosparcm test install” proves successful:

Running installation tests...
✓ Running as cryoSPARC owner
✓ Running on master node
✓ CryoSPARC is running
✓ Connected to command_core at http://em504-02.ibex.kaust.edu.sa:39002
✓ CRYOSPARC_LICENSE_ID environment variable is set
✓ License has correct format
✓ Insecure mode is disabled
✓ License server set to “https://get.cryosparc.com”
✓ Connection to license server succeeded
✓ License server returned success status code 200
✓ License server returned valid JSON response
✓ License exists and is valid
✓ CryoSPARC is running v4.0.3
✓ Running the latest version of CryoSPARC
Could not get latest patch (status code 404)
✓ Patch update not required
✓ Admin user has been created
✓ GPU worker connected.

But testing the workers prove unsuccessful (we are running the master + worker on the same node):

[zahodnbd@em504-02 ~]$ cryosparcm test workers P10
Using project P10
Running worker tests...
2022-12-08 16:06:00,532 WORKER_TEST          log                  CRITICAL | Worker test results
2022-12-08 16:06:00,533 WORKER_TEST          log                  CRITICAL | em504-02.ibex.kaust.edu.sa
2022-12-08 16:06:00,533 WORKER_TEST          log                  CRITICAL |   ✕ LAUNCH
2022-12-08 16:06:00,533 WORKER_TEST          log                  CRITICAL |     Error: 
2022-12-08 16:06:00,533 WORKER_TEST          log                  CRITICAL |     See P10 J5 for more information
2022-12-08 16:06:00,533 WORKER_TEST          log                  CRITICAL |   ⚠ SSD
2022-12-08 16:06:00,533 WORKER_TEST          log                  CRITICAL |     Did not run: Launch test failed
2022-12-08 16:06:00,533 WORKER_TEST          log                  CRITICAL |   ⚠ GPU
2022-12-08 16:06:00,533 WORKER_TEST          log                  CRITICAL |     Did not run: Launch test failed

The event log from P10/J5 can be found below, where there is a strange error that says it was killed by an unknown user:

License is valid.

Launching job on lane default target em504-02.ibex.kaust.edu.sa ...

Running job on master node hostname em504-02.ibex.kaust.edu.sa

**** Kill signal sent by unknown user ****

[CPU: 82.0 MB]
Job J5 Started

[CPU: 82.0 MB]
Master running v4.0.3, worker running v4.0.3

[CPU: 82.0 MB]
Working in directory: /ibex/scratch/projects/c2121/Brandon/cryosparc_datasets/CS-test-project/J5

[CPU: 82.0 MB]
Running on lane default

[CPU: 82.0 MB]
Resources allocated: 

[CPU: 82.0 MB]
  Worker:  em504-02.ibex.kaust.edu.sa

[CPU: 82.0 MB]
  CPU   :  [8]

[CPU: 82.0 MB]
  GPU   :  []

[CPU: 82.0 MB]
  RAM   :  [2]

[CPU: 82.0 MB]
  SSD   :  False

[CPU: 82.0 MB]
--------------------------------------------------------------

[CPU: 82.0 MB]
Importing job module for job type instance_launch_test...

[CPU: 190.4 MB]
Job ready to run

[CPU: 190.4 MB]
***************************************************************

[CPU: 190.5 MB]
Job successfully running

[CPU: 190.5 MB]
--------------------------------------------------------------

[CPU: 190.5 MB]
Compiling job outputs...

[CPU: 190.5 MB]
Updating job size...

[CPU: 190.5 MB]
Exporting job and creating csg files...

[CPU: 190.5 MB]
***************************************************************

[CPU: 190.5 MB]
Job complete. Total time 3.82s

CryoSPARC is installed on an isolated single node on our cluster, with 8GPU 40 CPUs, and was successfully running prior to the upgrade to 4.0. Our IT support team cannot locate the issues.

From the last working state of CryoSPARC: I created a backup of my database, upgraded to v4.0.1, then faced an issue that my jobs were stuck in the “launched” state. So, I deleted CryoSPARC and installed fresh to v4.0.3, where I then restored the database from the backup created prior to the upgrade. After this, I can run jobs that only run on the master (Import Micrographs) but face the above issue.
Additionally, due to the reinstallation of CryoSPARC, I also faced another error (below) in which there is a mismatch of expected instance ID’s, however in the examples above, P10 is a newly created project directory from this proper installation instance, which made me think this error is not contributing to the issue with the hanging job.

Unable to detach P1: ServerError: validation error: instance id mismatch for P1. Expected 82432f20-b84a-4f7c-a4f2-ccb135a8f658, actual e1b5b125-f99e-44e9-adb2-7bb6c91e1b60. This indicates that P1 was attached to another cryoSPARC instance without detaching from this one.

Any help is appreciated! Thank you!

cryoSPARC instance information

Type: master-worker
Software version: v4.0.3 from cryosparcm status

[zahodnbd@em504-02 ~]$ uname -a && free -g
Linux em504-02 3.10.0-1160.76.1.el7.x86_64 #1 SMP Wed Aug 10 16:21:17 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
              total        used        free      shared  buff/cache   available
Mem:            376           8         358           7           9         359
Swap:            29           0          29

CryoSPARC worker environment
Cuda Toolkit Path: /sw/csgv/cuda/11.2.2/el7.9_binary/
Pycuda information:

[zahodnbd@em504-02 ~]$ python -c "import pycuda.driver; print(pycuda.driver.get_version())"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/ibex/scratch/projects/c2121/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/pycuda/driver.py", line 62, in <module>
    from pycuda._driver import *  # noqa
ImportError: libcurand.so.9.0: cannot open shared object file: No such file or directory

[zahodnbd@em504-02 ~]$ uname -a && free -g && nvidia-smi
Linux em504-02 3.10.0-1160.76.1.el7.x86_64 #1 SMP Wed Aug 10 16:21:17 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
              total        used        free      shared  buff/cache   available
Mem:            376           8         358           7           9         359
Swap:            29           0          29
Thu Dec  8 18:40:23 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:1A:00.0 Off |                  N/A |
| 29%   25C    P8    10W / 250W |      1MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  On   | 00000000:1C:00.0 Off |                  N/A |
| 30%   28C    P8    10W / 250W |      1MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce ...  On   | 00000000:1D:00.0 Off |                  N/A |
| 30%   27C    P8    17W / 250W |      1MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce ...  On   | 00000000:1E:00.0 Off |                  N/A |
| 31%   27C    P8    18W / 250W |      1MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA GeForce ...  On   | 00000000:3D:00.0 Off |                  N/A |
| 30%   24C    P8     4W / 250W |      1MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA GeForce ...  On   | 00000000:3F:00.0 Off |                  N/A |
| 30%   26C    P8    26W / 250W |      1MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA GeForce ...  On   | 00000000:40:00.0 Off |                  N/A |
| 31%   25C    P8     2W / 250W |      1MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA GeForce ...  On   | 00000000:41:00.0 Off |                  N/A |
| 30%   26C    P8    10W / 250W |      1MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

The pycuda information is surprising, as I also changed the cuda path to a different path, and it had mentioned it was successful in redownloading pycuda.

wtempel · December 8, 2022, 6:01pm

Is the CryoSPARC instance, which runs on a cluster node, subject to workload management by some cluster software like gridengine, slurm, or do you access em504-02 directly via ssh, to run commands like
cryosparcm start?
Please can you run

/ibex/scratch/projects/c2121/cryosparc/cryosparc_worker/bin/cryosparcw newcuda /sw/csgv/cuda/11.2.2/el7.9_binary 2>&1 | tee newcuda_$(date +%F).out

(any errors?) and then

/ibex/scratch/projects/c2121/cryosparc/cryosparc_worker/bin/cryosparcw call which nvcc

and

/ibex/scratch/projects/c2121/cryosparc/cryosparc_worker/bin/cryosparcw call python -c "import pycuda.driver; print(pycuda.driver.get_version())"

zahodnbd · December 9, 2022, 3:57am

Hi! Thanks for your reply.

The CryoSPARC instance is not subject to any workload management at the moment. I access it directly via ssh to run commands. It was installed with the master and worker separately, but following the installation workflow as indicated on the guide.

First Command:

[zahodnbd@em504-02 ~]$ /ibex/scratch/projects/c2121/cryosparc/cryosparc_worker/bin/cryosparcw newcuda /sw/csgv/cuda/11.2.2/el7.9_binary 2>&1 | tee newcuda_$(date +%F).out
 New CUDA Path was provided as /sw/csgv/cuda/11.2.2/el7.9_binary
 Checking CUDA installation...
 Found nvcc at /sw/csgv/cuda/11.2.2/el7.9_binary/bin/nvcc
 The above cuda installation will be used but can be changed later.
 Proceeding to uninstall pycuda...
Found existing installation: pycuda 2020.1
Uninstalling pycuda-2020.1:
  Would remove:
    /ibex/scratch/projects/c2121/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/pycuda-2020.1.dist-info/*
    /ibex/scratch/projects/c2121/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/pycuda/*
Proceed (Y/n)? Y
  Successfully uninstalled pycuda-2020.1
 Replacing path within config.sh...
 Proceeding to recompile pycuda with the new CUDA binaries
Processing ./deps_bundle/python/python_packages/pip_packages/pycuda-2020.1.tar.gz
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Requirement already satisfied: pytools>=2011.2 in ./deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages (from pycuda==2020.1) (2020.4.4)
Requirement already satisfied: decorator>=3.2.0 in ./deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages (from pycuda==2020.1) (4.4.2)
Requirement already satisfied: appdirs>=1.4.0 in ./deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages (from pycuda==2020.1) (1.4.4)
Requirement already satisfied: mako in ./deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages (from pycuda==2020.1) (1.1.6)
Requirement already satisfied: numpy>=1.6.0 in ./deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages (from pytools>=2011.2->pycuda==2020.1) (1.19.5)
Requirement already satisfied: MarkupSafe>=0.9.2 in ./deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages (from mako->pycuda==2020.1) (2.0.1)
Building wheels for collected packages: pycuda
  Building wheel for pycuda (setup.py): started
  Building wheel for pycuda (setup.py): still running...
  Building wheel for pycuda (setup.py): still running...
  Building wheel for pycuda (setup.py): finished with status 'done'
  Created wheel for pycuda: filename=pycuda-2020.1-cp37-cp37m-linux_x86_64.whl size=641415 sha256=f8967eacdfdd4a9e12a335829266632034c4cf4d9e36c2f7e0bc0a0d40650f71
  Stored in directory: /tmp/pip-ephem-wheel-cache-edbvf2ze/wheels/34/51/ce/6939748c9e70aa429cf0bfa5c4c26abc3fc0707e82baf4da0b
Successfully built pycuda
Installing collected packages: pycuda
Successfully installed pycuda-2020.1
 Finished!

Second Command:

[zahodnbd@em504-02 ~]$ /ibex/scratch/projects/c2121/cryosparc/cryosparc_worker/bin/cryosparcw call which nvcc
/sw/csgv/cuda/11.2.2/el7.9_binary/bin/nvcc

Third Command:

[zahodnbd@em504-02 ~]$ /ibex/scratch/projects/c2121/cryosparc/cryosparc_worker/bin/cryosparcw call python -c "import pycuda.driver; print(pycuda.driver.get_version())"
(11, 2, 0)

zahodnbd · December 9, 2022, 4:04am

Running the CryoSPARC Worker environment steps again proves successful now (?)

[zahodnbd@em504-02 ~]$ eval $(/ibex/scratch/projects/c2121/cryosparc/cryosparc_worker/bin/cryosparcw env)
[zahodnbd@em504-02 ~]$ echo $CRYOSPARC_CUDA_PATH
/sw/csgv/cuda/11.2.2/el7.9_binary
[zahodnbd@em504-02 ~]$ ${CRYOSPARC_CUDA_PATH}/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_21:12:58_PST_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0
[zahodnbd@em504-02 ~]$ python -c "import pycuda.driver; print(pycuda.driver.get_version())"
(11, 2, 0)
[zahodnbd@em504-02 ~]$ uname -a && free -g && nvidia-smi
Linux em504-02 3.10.0-1160.76.1.el7.x86_64 #1 SMP Wed Aug 10 16:21:17 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
              total        used        free      shared  buff/cache   available
Mem:            376           8         357           7          10         358
Swap:            29           0          29
Fri Dec  9 07:01:02 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:1A:00.0 Off |                  N/A |
| 29%   25C    P8    10W / 250W |      1MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  On   | 00000000:1C:00.0 Off |                  N/A |
| 30%   28C    P8    10W / 250W |      1MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce ...  On   | 00000000:1D:00.0 Off |                  N/A |
| 30%   27C    P8    17W / 250W |      1MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce ...  On   | 00000000:1E:00.0 Off |                  N/A |
| 30%   27C    P8    18W / 250W |      1MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA GeForce ...  On   | 00000000:3D:00.0 Off |                  N/A |
| 30%   24C    P8     4W / 250W |      1MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA GeForce ...  On   | 00000000:3F:00.0 Off |                  N/A |
| 30%   26C    P8    26W / 250W |      1MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA GeForce ...  On   | 00000000:40:00.0 Off |                  N/A |
| 31%   24C    P8     3W / 250W |      1MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA GeForce ...  On   | 00000000:41:00.0 Off |                  N/A |
| 30%   26C    P8    12W / 250W |      1MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Testing the Cryosparc workers fails still, though:

[zahodnbd@em504-02 ~]$ cryosparcm test workers P10
Using project P10
Running worker tests...
2022-12-09 07:14:41,042 WORKER_TEST          log                  CRITICAL | Worker test results
2022-12-09 07:14:41,042 WORKER_TEST          log                  CRITICAL | em504-02.ibex.kaust.edu.sa
2022-12-09 07:14:41,042 WORKER_TEST          log                  CRITICAL |   ✕ LAUNCH
2022-12-09 07:14:41,042 WORKER_TEST          log                  CRITICAL |     Error: 
2022-12-09 07:14:41,042 WORKER_TEST          log                  CRITICAL |     See P10 J7 for more information
2022-12-09 07:14:41,042 WORKER_TEST          log                  CRITICAL |   ⚠ SSD
2022-12-09 07:14:41,043 WORKER_TEST          log                  CRITICAL |     Did not run: Launch test failed
2022-12-09 07:14:41,043 WORKER_TEST          log                  CRITICAL |   ⚠ GPU
2022-12-09 07:14:41,043 WORKER_TEST          log                  CRITICAL |     Did not run: Launch test failed

P10 J7 Event Log from the dashboard:

License is valid.

Launching job on lane default target em504-02.ibex.kaust.edu.sa ...

Running job on master node hostname em504-02.ibex.kaust.edu.sa

**** Kill signal sent by unknown user ****

[CPU: 84.2 MB]
Job J7 Started

[CPU: 84.3 MB]
Master running v4.0.3, worker running v4.0.3

[CPU: 84.3 MB]
Working in directory: /ibex/scratch/projects/c2121/Brandon/cryosparc_datasets/CS-test-project/J7

[CPU: 84.3 MB]
Running on lane default

[CPU: 84.3 MB]
Resources allocated: 

[CPU: 84.3 MB]
  Worker:  em504-02.ibex.kaust.edu.sa

[CPU: 84.3 MB]
  CPU   :  [4]

[CPU: 84.3 MB]
  GPU   :  []

[CPU: 84.3 MB]
  RAM   :  [1]

[CPU: 84.3 MB]
  SSD   :  False

[CPU: 84.3 MB]
--------------------------------------------------------------

[CPU: 84.3 MB]
Importing job module for job type instance_launch_test...

[CPU: 190.6 MB]
Job ready to run

[CPU: 190.7 MB]
***************************************************************

[CPU: 190.7 MB]
Job successfully running

[CPU: 190.7 MB]
--------------------------------------------------------------

[CPU: 190.7 MB]
Compiling job outputs...

[CPU: 190.7 MB]
Updating job size...

[CPU: 190.7 MB]
Exporting job and creating csg files...

[CPU: 190.7 MB]
***************************************************************

[CPU: 190.7 MB]
Job complete. Total time 0.44s

P10 J7 job.log file:

[zahodnbd@em504-02 J7]$ cat job.log


================= CRYOSPARCW =======  2022-12-09 07:14:48.234614  =========
Project P10 Job J7
Master em504-02.ibex.kaust.edu.sa Port 39002
===========================================================================
========= monitor process now starting main process
MAINPROCESS PID 58822
========= monitor process now waiting for main process
MAIN PID 58822
instance_testing.run cryosparc_compute.jobs.jobregister
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
***************************************************************
***************************************************************
========= main process now complete.
========= monitor process now complete.

zahodnbd · December 9, 2022, 11:46am

Update – My command “cryosparcm test workers P10” still fails as above; however, my jobs (Import Micrograph and Extract with a small subset) appear to now work…

I am not sure if the worker error will pose issues later or not…

wtempel · December 9, 2022, 2:30pm

Have you also tried a GPU-based job like 2D classification?

Please can you post the output of the commands
cryosparcm cli "get_scheduler_targets()"
and
nvidia-smi -q -d compute

zahodnbd · December 9, 2022, 8:28pm

I have just started a 2D classification job with the output of the small extracted particle set, and it is currently running (with default parameters + 4 GPUs). It appears to be working…

Below is the output from the commands you suggested:

zahodnbd@em504-02 ~]$ cryosparcm cli "get_scheduler_targets()"
[{'cache_path': None, 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 11554848768, 'name': 'NVIDIA GeForce RTX 2080 Ti'}, {'id': 1, 'mem': 11554848768, 'name': 'NVIDIA GeForce RTX 2080 Ti'}, {'id': 2, 'mem': 11554848768, 'name': 'NVIDIA GeForce RTX 2080 Ti'}, {'id': 3, 'mem': 11554848768, 'name': 'NVIDIA GeForce RTX 2080 Ti'}, {'id': 4, 'mem': 11554848768, 'name': 'NVIDIA GeForce RTX 2080 Ti'}, {'id': 5, 'mem': 11554848768, 'name': 'NVIDIA GeForce RTX 2080 Ti'}, {'id': 6, 'mem': 11554848768, 'name': 'NVIDIA GeForce RTX 2080 Ti'}, {'id': 7, 'mem': 11554848768, 'name': 'NVIDIA GeForce RTX 2080 Ti'}], 'hostname': 'em504-02.ibex.kaust.edu.sa', 'lane': 'default', 'monitor_port': None, 'name': 'em504-02.ibex.kaust.edu.sa', 'resource_fixed': {'SSD': False}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39], 'GPU': [0, 1, 2, 3, 4, 5, 6, 7], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47]}, 'ssh_str': 'zahodnbd@em504-02.ibex.kaust.edu.sa', 'title': 'Worker node em504-02.ibex.kaust.edu.sa', 'type': 'node', 'worker_bin_path': '/ibex/scratch/projects/c2121/cryosparc/cryosparc_worker/bin/cryosparcw'}]
[zahodnbd@em504-02 ~]$ nvidia-smi -q -d compute

==============NVSMI LOG==============

Timestamp                                 : Fri Dec  9 23:25:13 2022
Driver Version                            : 520.61.05
CUDA Version                              : 11.8

Attached GPUs                             : 8
GPU 00000000:1A:00.0
    Compute Mode                          : Default

GPU 00000000:1C:00.0
    Compute Mode                          : Default

GPU 00000000:1D:00.0
    Compute Mode                          : Default

GPU 00000000:1E:00.0
    Compute Mode                          : Default

GPU 00000000:3D:00.0
    Compute Mode                          : Default

GPU 00000000:3F:00.0
    Compute Mode                          : Default

GPU 00000000:40:00.0
    Compute Mode                          : Default

GPU 00000000:41:00.0
    Compute Mode                          : Default

zahodnbd · December 9, 2022, 8:35pm

2D classification of this small particle set was completed successfully with 4 GPUs… I guess it is working? I will update the thread if I encounter additional errors.

Also, please let me know if you would like me to run any additional commands to assess any causes of this issue (and continued failure of the cryosparc test workers command…) [There is quite a time difference, but I will update ASAP]

wtempel · December 16, 2022, 9:00pm

zahodnbd:

Additionally, due to the reinstallation of CryoSPARC, I also faced another error (below) in which there is a mismatch of expected instance ID’s, however in the examples above, P10 is a newly created project directory from this proper installation instance, which made me think this error is not contributing to the issue with the hanging job.
Unable to detach P1: ServerError: validation error: instance id mismatch for P1. Expected 82432f20-b84a-4f7c-a4f2-ccb135a8f658, actual e1b5b125-f99e-44e9-adb2-7bb6c91e1b60. This indicates that P1 was attached to another cryoSPARC instance without detaching from this one.

This may indeed be unrelated to the cryosparcm test workers errors. For a possible resolution, please see Instance ID mismatch

wtempel · December 20, 2022, 5:39pm

We have just released CryoSPARC v4.1.1. Would you like to check if this software update resolves these
cryosparcm test workers errors on your system?

zahodnbd · January 22, 2023, 1:23pm

Hi, Sorry I just finally got around to updating CryoSPARC following the holidays.

Unfortunately, it seems like the update to v4.1.1 and the further patch +230110 did not solve the issue with the cryosparcm test workers errors

[zahodnbd@em504-02 cryosparc_master]$ cryosparcm test install
Running installation tests...
? Running as cryoSPARC owner
? Running on master node
? CryoSPARC is running
? Connected to command_core at http://em504-02.ibex.kaust.edu.sa:39002
? CRYOSPARC_LICENSE_ID environment variable is set
? License has correct format
? Insecure mode is disabled
? License server set to "https://get.cryosparc.com"
? Connection to license server succeeded
? License server returned success status code 200
? License server returned valid JSON response
? License exists and is valid
? CryoSPARC is running v4.1.1+230110
? Running the latest version of CryoSPARC
? Patch update not required
? Admin user has been created
? GPU worker connected.

and

zahodnbd@em504-02 cryosparc_master]$ cryosparcm test w P10
Using project P10
Running worker tests...
2023-01-22 16:05:03,574 WORKER_TEST          log                  CRITICAL | Worker test results
2023-01-22 16:05:03,574 WORKER_TEST          log                  CRITICAL | em504-02.ibex.kaust.edu.sa
2023-01-22 16:05:03,574 WORKER_TEST          log                  CRITICAL |   ? LAUNCH
2023-01-22 16:05:03,574 WORKER_TEST          log                  CRITICAL |     No SSD available
2023-01-22 16:05:03,574 WORKER_TEST          log                  CRITICAL |   ? SSD
2023-01-22 16:05:03,574 WORKER_TEST          log                  CRITICAL |     Did not run: Launch test failed
2023-01-22 16:05:03,574 WORKER_TEST          log                  CRITICAL |   ? GPU
2023-01-22 16:05:03,574 WORKER_TEST          log                  CRITICAL |     Did not run: Launch test failed

wtempel · January 22, 2023, 9:08pm

Please can you post event and job logs for this latest
cryosparcm test w P10 attempt.
Are you still able to run 2D classification or other GPU-enabled job types?

zahodnbd · January 23, 2023, 6:56am

Are you still able to run 2D classification or other GPU-enabled job types?

I was still able to run a GPU-enabled job type. I had cloned a previous 2D classification job that used 2 GPU’s and had previously completed in 1h 21m and the cloned job ran successfully; however, it took 4h 33m to complete…

Please can you post event and job logs for this latest
cryosparcm test w P10 attempt.

These latest attempts did not produce event or job logs, as the workspace it created is empty. Comparing it against the previously failed worker tests, this one failed on launch due to “no SSD available” whereas the last one appears to have passed that test.

This is a photo from the 2 recent attempts W6 & W7, along with the previous attempts in December

For v4.1.1, I simply updated and patched cryosparc using cryosparcm update and cryosparcm patch successively after ssh’ing onto the master node. Not sure if that would have skipped the --nossd flag that I included during installation…