CryoSPARC v4 hangs when running GPU jobs

We just upgraded to cryoSPARC 4, and there appears to be a problem when running jobs on the GPU. For example, cryoSPARC does not run when attempting to run processes such as heterogeneous refinement, non-uniform refinement, etc. We tested CPU-based job, such as selecting 2Ds, which finishes successfully and uploads to the database.

In the event log on the web-server, the only information that is displayed is below:

[2022-10-16 21:41:58.05]. License is valid.
[2022-10-16 21:41:58.05]. Launching job on lane default target infinity.salk.edu ...
[2022-10-16 21:41:58.08]. Running job on master node hostname infinity.salk.edu

Subsequently, there is nothing else logged.

cryosparc_master/run/command_core.log indicates that the job has finished, tailed logfile pasted below:

2022-10-16 21:44:28,008 COMMAND.JOBS         set_job_status       INFO     | Status changed for P28.J135 from waiting to running
2022-10-16 21:44:41,084 COMMAND.DATA         dump_job_database    INFO     | Request to export P28 J135
2022-10-16 21:44:41,086 COMMAND.DATA         dump_job_database    INFO     |    Exporting job to /log-l/netapp/data5/zshan/cryoSPARC/19jan04c/P28/J135
2022-10-16 21:44:41,088 COMMAND.DATA         dump_job_database    INFO     |    Exporting all of job's images in the database to /log-l/netapp/data5/zshan/cryoSPARC/19jan04c/P28/J135/gridfs_data...
2022-10-16 21:44:41,152 COMMAND.DATA         dump_job_database    INFO     |    Writing 59 database images to /log-l/netapp/data5/zshan/cryoSPARC/19jan04c/P28/J135/gridfs_data/gridfsdata_0
2022-10-16 21:44:41,153 COMMAND.DATA         dump_job_database    INFO     |    Done. Exported 59 images in 0.06s
2022-10-16 21:44:41,153 COMMAND.DATA         dump_job_database    INFO     |    Exporting all job's streamlog events...
2022-10-16 21:44:41,156 COMMAND.DATA         dump_job_database    INFO     |    Done. Exported 1 files in 0.00s
2022-10-16 21:44:41,156 COMMAND.DATA         dump_job_database    INFO     |    Exporting job metafile...
2022-10-16 21:44:41,158 COMMAND.DATA         dump_job_database    INFO     |    Creating .csg file for particles_selected
2022-10-16 21:44:41,169 COMMAND.DATA         dump_job_database    INFO     |    Creating .csg file for templates_selected
2022-10-16 21:44:41,179 COMMAND.DATA         dump_job_database    INFO     |    Creating .csg file for particles_excluded
2022-10-16 21:44:41,188 COMMAND.DATA         dump_job_database    INFO     |    Creating .csg file for templates_excluded
2022-10-16 21:44:41,210 COMMAND.DATA         dump_job_database    INFO     |    Done. Exported in 0.05s
2022-10-16 21:44:41,210 COMMAND.DATA         dump_job_database    INFO     |    Updating job manifest...
2022-10-16 21:44:41,214 COMMAND.DATA         dump_job_database    INFO     |    Done. Updated in 0.00s
2022-10-16 21:44:41,214 COMMAND.DATA         dump_job_database    INFO     | Exported P28 J135 in 0.13s
2022-10-16 21:44:41,231 COMMAND.JOBS         set_job_status       INFO     | Status changed for P28.J135 from running to completed

cryosparc_master/run/command_vis.log returns an out-of-range index error, tailed logfile pasted below:

2022-10-16 18:35:11,733 VIS.MAIN             recreate_mesh        INFO     | Loading mesh for P28 J133.volume.map_sharp
[2022-10-16 18:35:11,742] ERROR in app: Exception on /P28/J133.volume.map_sharp [GET]
Traceback (most recent call last):
  File "/home/cryospuser/cryosparc2/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/flask/app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "/home/cryospuser/cryosparc2/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/flask/app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/home/cryospuser/cryosparc2/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/flask/app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/home/cryospuser/cryosparc2/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/home/cryospuser/cryosparc2/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/home/cryospuser/cryosparc2/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/home/cryospuser/cryosparc2/cryosparc_master/cryosparc_command/command_vis/__init__.py", line 142, in recreate_mesh
    result = cli.get_job_result(project_uid, src_result) # only gives one version and metafiles
  File "/home/cryospuser/cryosparc2/cryosparc_master/cryosparc_compute/client.py", line 66, in func
    + self._format_server_error(res['error'])
AssertionError: Encountered error for method "get_job_result" with params ('P28', 'J133.volume.map_sharp'):
ServerError: list index out of range
Traceback (most recent call last):
  File "/home/cryospuser/cryosparc2/cryosparc_master/cryosparc_command/commandcommon.py", line 194, in wrapper
    res = func(*args, **kwargs)
  File "/home/cryospuser/cryosparc2/cryosparc_master/cryosparc_command/command_core/__init__.py", line 6316, in get_job_result
    output_result['version']  = output_result['versions'][idx]
IndexError: list index out of range

The information about Cuda version is:

dlyumkis@infinity cryosparc_master] nvidia-smi 
Sun Oct 16 21:49:14 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:3B:00.0 Off |                    0 |
| N/A   32C    P0    26W / 250W |      4MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Any help would be appreciated.

Dmitry

Please can you email us the failed job’s error report.