I previously had a workstation with 3x 2080Tis and Linuxmint 20 (Ulyana) installed on a 1TB SSD. I took the SSD out of that workstation and installed it into another workstation which has 1x 2080Ti and 1x 780Ti (only used for display). The OS booted up fine and has been running all (almost) the software I’ve previously used. The exception to this is cryosparc. I have tried running both NU-R and Homogenous refinements. Both fail while calculating FSCs. Seems like the cryosparc instance is not able to connect to the GPU.
Troubleshooting steps I’ve tried
- Uninstalled and reinstalled CUDA (10.2) without errors
- Uninstalled and reinstalled cryosparc without errors
Output from cryosparcm log command_core
[IMPORT_PROJECT] : Uploading project image data... [IMPORT_PROJECT] : Done. Uploaded 0 files in 0.00s [IMPORT_PROJECT] : Inserted job document in 0.00s... [IMPORT_PROJECT] : Inserting streamlogs into jobs... [IMPORT_PROJECT] : Done. Inserted 0 streamlogs in 0.00s... [IMPORT_PROJECT] : Imported J27 into P1 in 0.00s... [IMPORT_PROJECT] : Uploading project image data... [IMPORT_PROJECT] : Done. Uploaded 0 files in 0.00s [IMPORT_PROJECT] : Inserted job document in 0.01s... [IMPORT_PROJECT] : Inserting streamlogs into jobs... [IMPORT_PROJECT] : Done. Inserted 0 streamlogs in 0.00s... [IMPORT_PROJECT] : Imported J3 into P1 in 0.01s... [IMPORT_PROJECT] : Uploading project image data... [IMPORT_PROJECT] : Done. Uploaded 0 files in 0.00s [IMPORT_PROJECT] : Inserted job document in 0.00s... [IMPORT_PROJECT] : Inserting streamlogs into jobs... [IMPORT_PROJECT] : Done. Inserted 0 streamlogs in 0.00s... [IMPORT_PROJECT] : Imported J4 into P1 in 0.00s... [IMPORT_PROJECT] : Uploading project image data... [IMPORT_PROJECT] : Done. Uploaded 4 files in 0.03s [IMPORT_PROJECT] : Inserted job document in 0.04s... [IMPORT_PROJECT] : Inserting streamlogs into jobs... [IMPORT_PROJECT] : Done. Inserted 25 streamlogs in 0.03s... [IMPORT_PROJECT] : Imported J5 into P1 in 0.07s... [IMPORT_PROJECT] : Uploading project image data... [IMPORT_PROJECT] : Done. Uploaded 8 files in 0.04s [IMPORT_PROJECT] : Inserted job document in 0.05s... [IMPORT_PROJECT] : Inserting streamlogs into jobs... [IMPORT_PROJECT] : Done. Inserted 28 streamlogs in 0.02s... [IMPORT_PROJECT] : Imported J6 into P1 in 0.07s... [IMPORT_PROJECT] : Uploading project image data... [IMPORT_PROJECT] : Done. Uploaded 8 files in 0.04s [IMPORT_PROJECT] : Inserted job document in 0.05s... [IMPORT_PROJECT] : Inserting streamlogs into jobs... [IMPORT_PROJECT] : Done. Inserted 28 streamlogs in 0.02s... [IMPORT_PROJECT] : Imported J7 into P1 in 0.07s... [IMPORT_PROJECT] : Uploading project image data... [IMPORT_PROJECT] : Done. Uploaded 8 files in 0.03s [IMPORT_PROJECT] : Inserted job document in 0.05s... [IMPORT_PROJECT] : Inserting streamlogs into jobs... [IMPORT_PROJECT] : Done. Inserted 28 streamlogs in 0.03s... [IMPORT_PROJECT] : Imported J8 into P1 in 0.07s... [IMPORT_PROJECT] : Uploading project image data... [IMPORT_PROJECT] : Done. Uploaded 8 files in 0.05s [IMPORT_PROJECT] : Inserted job document in 0.06s... [IMPORT_PROJECT] : Inserting streamlogs into jobs... [IMPORT_PROJECT] : Done. Inserted 28 streamlogs in 0.01s... [IMPORT_PROJECT] : Imported J9 into P1 in 0.07s... [IMPORT_PROJECT] : Imported project from /mnt/12T_HDD2/XX/P12 as P1 in 4.03s [EXPORT_PROJECT] : Exporting project P1... [EXPORT_PROJECT] : Exported project P1 to /mnt/12T_HDD2/XX/P12/project.json in 0.00s ---- Deleting project UID P1 job UID J27 Now clearing job.. [EXPORT_JOB] : Request to export P1 J27 [EXPORT_JOB] : Exporting job to /mnt/12T_HDD2/XX/P12/J27 [EXPORT_JOB] : Exporting all of job's images in the database to /mnt/12T_HDD2/XX/P12/J27/gridfs_data... [EXPORT_JOB] : Done. Exported 0 images in 0.00s [EXPORT_JOB] : Exporting all job's streamlog events... [EXPORT_JOB] : Done. Exported 1 files in 0.00s [EXPORT_JOB] : Exporting job metafile... [EXPORT_JOB] : Done. Exported in 0.01s [EXPORT_JOB] : Updating job manifest... [EXPORT_JOB] : Done. Updated in 0.00s [EXPORT_JOB] : Exported P1 J27 in 0.03s [EXPORT_PROJECT] : Exporting project P1... [EXPORT_PROJECT] : Exported project P1 to /mnt/12T_HDD2/XX/P12/project.json in 0.00s [EXPORT_JOB] : Request to export P1 J28 [EXPORT_JOB] : Exporting job to /mnt/12T_HDD2/XX/P12/J28 [EXPORT_JOB] : Exporting all of job's images in the database to /mnt/12T_HDD2/XX/P12/J28/gridfs_data... [EXPORT_JOB] : Done. Exported 0 images in 0.00s [EXPORT_JOB] : Exporting all job's streamlog events... [EXPORT_JOB] : Done. Exported 1 files in 0.00s [EXPORT_JOB] : Exporting job metafile... [EXPORT_JOB] : Done. Exported in 0.00s [EXPORT_JOB] : Updating job manifest... [EXPORT_JOB] : Done. Updated in 0.00s [EXPORT_JOB] : Exported P1 J28 in 0.01s [EXPORT_JOB] : Request to export P1 J22 [EXPORT_JOB] : Exporting job to /mnt/12T_HDD2/XX/P12/J22 [EXPORT_JOB] : Exporting all of job's images in the database to /mnt/12T_HDD2/XXP12/J22/gridfs_data... [EXPORT_JOB] : Request to export P1 J22 [EXPORT_JOB] : Exporting job to /mnt/12T_HDD2/XX/P12/J22 [EXPORT_JOB] : Exporting all of job's images in the database to /mnt/12T_HDD2/XX/P12/J22/gridfs_data... [EXPORT_JOB] : Writing 167 database images to /mnt/12T_HDD2/XX/P12/J22/gridfs_data/gridfsdata_0 [EXPORT_JOB] : Done. Exported 167 images in 0.39s [EXPORT_JOB] : Exporting all job's streamlog events... [EXPORT_JOB] : Done. Exported 1 files in 0.01s [EXPORT_JOB] : Exporting job metafile... [EXPORT_JOB] : Creating .csg file for particles [EXPORT_JOB] : Creating .csg file for volume [EXPORT_JOB] : Creating .csg file for mask [EXPORT_JOB] : Done. Exported in 0.02s [EXPORT_JOB] : Updating job manifest... [EXPORT_JOB] : Done. Updated in 0.00s [EXPORT_JOB] : Exported P1 J22 in 0.42s [EXPORT_JOB] : Request to export P1 J28 [EXPORT_JOB] : Exporting job to /mnt/12T_HDD2/XX/P12/J28 [EXPORT_JOB] : Exporting all of job's images in the database to /mnt/12T_HDD2/XX/P12/J28/gridfs_data... [EXPORT_JOB] : Done. Exported 0 images in 0.00s [EXPORT_JOB] : Exporting all job's streamlog events... [EXPORT_JOB] : Done. Exported 1 files in 0.00s [EXPORT_JOB] : Exporting job metafile... [EXPORT_JOB] : Done. Exported in 0.00s [EXPORT_JOB] : Updating job manifest... [EXPORT_JOB] : Done. Updated in 0.00s [EXPORT_JOB] : Exported P1 J28 in 0.01s ---------- Scheduler running --------------- Jobs Queued: [(u'P1', u'J28')] Licenses currently active : 0 Now trying to schedule J28 Need slots : {u'GPU': 1, u'RAM': 3, u'CPU': 4} Need fixed : {u'SSD': True} Master direct : False Running job directly on GPU id(s): [0] on Linuxbox Failed to connect link: HTTP Error 502: Bad Gateway Not a commercial instance - heartbeat set to 12 hours. Launchable! -- Launching. Changed job P1.J28 status launched ---------- Scheduler finished --------------- Changed job P1.J28 status started Changed job P1.J28 status running ---- Killing project UID P1 job UID J28 Killing job on worker type node Linuxbox Killing job on worker on same node as master, not using ssh Changed job P1.J28 status killed [EXPORT_JOB] : Request to export P1 J28 [EXPORT_JOB] : Exporting job to /mnt/12T_HDD2/XX/P12/J28 [EXPORT_JOB] : Exporting all of job's images in the database to /mnt/12T_HDD2/XX/P12/J28/gridfs_data... [EXPORT_JOB] : Done. Exported 0 images in 0.00s [EXPORT_JOB] : Exporting all job's streamlog events... [EXPORT_JOB] : Done. Exported 1 files in 0.00s [EXPORT_JOB] : Exporting job metafile... [EXPORT_JOB] : Done. Exported in 0.01s [EXPORT_JOB] : Updating job manifest... [EXPORT_JOB] : Done. Updated in 0.00s [EXPORT_JOB] : Exported P1 J28 in 0.03s [EXPORT_JOB] : Request to export P1 J28 [EXPORT_JOB] : Exporting job to /mnt/12T_HDD2/XX/P12/J28 [EXPORT_JOB] : Exporting all of job's images in the database to /mnt/12T_HDD2/XX/P12/J28/gridfs_data... [EXPORT_JOB] : Done. Exported 0 images in 0.00s [EXPORT_JOB] : Exporting all job's streamlog events... [EXPORT_JOB] : Done. Exported 1 files in 0.00s [EXPORT_JOB] : Exporting job metafile... [EXPORT_JOB] : Done. Exported in 0.00s [EXPORT_JOB] : Updating job manifest... [EXPORT_JOB] : Done. Updated in 0.00s [EXPORT_JOB] : Exported P1 J28 in 0.01s ---------- Scheduler running --------------- Jobs Queued: [(u'P1', u'J28')] Licenses currently active : 0 Now trying to schedule J28 Need slots : {u'GPU': 1, u'RAM': 3, u'CPU': 4} Need fixed : {u'SSD': True} Master direct : False Running job directly on GPU id(s): [0] on Linuxbox Failed to connect link: HTTP Error 502: Bad Gateway Not a commercial instance - heartbeat set to 12 hours. Launchable! -- Launching. Changed job P1.J28 status launched Running project UID P1 job UID J28 Running job on worker type node Running job using: /home/vamsee/software/cryosparc/cryosparc2_worker/bin/cryosparcw ---------- Scheduler finished --------------- Changed job P1.J28 status started Changed job P1.J28 status running Changed job P1.J28 status failed COMMAND CORE STARTED === 2020-12-04 16:57:59.125352 ========================== *** BG WORKER START [EXPORT_PROJECT] : Exporting project P1... [EXPORT_PROJECT] : Exported project P1 to /mnt/12T_HDD2/XX/P12/project.json in 0.03s [EXPORT_JOB] : Request to export P1 J28 [EXPORT_JOB] : Exporting job to /mnt/12T_HDD2/XX/P12/J28 [EXPORT_JOB] : Exporting all of job's images in the database to /mnt/12T_HDD2/XX/P12/J28/gridfs_data... [EXPORT_JOB] : Done. Exported 0 images in 0.00s [EXPORT_JOB] : Exporting all job's streamlog events... [EXPORT_JOB] : Done. Exported 1 files in 0.00s [EXPORT_JOB] : Exporting job metafile... [EXPORT_JOB] : Done. Exported in 0.00s [EXPORT_JOB] : Updating job manifest... [EXPORT_JOB] : Done. Updated in 0.21s [EXPORT_JOB] : Exported P1 J28 in 0.22s ---------- Scheduler running --------------- Jobs Queued: [(u'P1', u'J28')] Licenses currently active : 0 Now trying to schedule J28 Need slots : {u'GPU': 1, u'RAM': 3, u'CPU': 4} Need fixed : {u'SSD': True} Master direct : False Scheduling job to Linuxbox Failed to connect link: HTTP Error 502: Bad Gateway Not a commercial instance - heartbeat set to 12 hours. Launchable! -- Launching. Changed job P1.J28 status launched Running project UID P1 job UID J28 Running job on worker type node Running job using: /home/vamsee/software/cryosparc/cryosparc2_worker/bin/cryosparcw ---------- Scheduler finished --------------- Changed job P1.J28 status started Changed job P1.J28 status running Changed job P1.J28 status failed
Output from cryosparcm joblog P1 J28
> ================= CRYOSPARCW ======= 2020-12-04 17:01:57.033313 =========
> Project P1 Job J28
> Master Linuxbox Port 39002
> ===========================================================================
> ========= monitor process now starting main process
> MAINPROCESS PID 14716
> ========= monitor process now waiting for main process
> MAIN PID 14716
> nonuniform_refine.run cryosparc2_compute.jobs.jobregister
> /home/vamsee/software/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/skcuda/cublas.py:284: UserWarning: creating CUBLAS context to get version number
> warnings.warn('creating CUBLAS context to get version number')
> ========= sending heartbeat
> ========= sending heartbeat
> ========= sending heartbeat
> ========= sending heartbeat
> ========= sending heartbeat
> ========= sending heartbeat
> ========= sending heartbeat
> ========= sending heartbeat
> ========= sending heartbeat
> ***************************************************************
> Running job J28 of type nonuniform_refine
> Running job on hostname %s Linuxbox
> Allocated Resources : {u'lane': u'default', u'target': {u'monitor_port': None, u'lane': u'default', u'name': u'Linuxbox', u'title': u'Worker node Linuxbox', u'resource_slots': {u'GPU': [0, 1], u'RAM': [0, 1, 2, 3], u'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]}, u'hostname': u'Linuxbox', u'worker_bin_path': u'/home/vamsee/software/cryosparc/cryosparc2_worker/bin/cryosparcw', u'cache_path': u'/mnt/NV_HDD', u'cache_quota_mb': None, u'resource_fixed': {u'SSD': True}, u'gpus': [{u'mem': 11554717696, u'id': 0, u'name': u'GeForce RTX 2080 Ti'}, {u'mem': 3168468992, u'id': 1, u'name': u'GeForce GTX 780 Ti'}], u'cache_reserve_mb': 10000, u'type': u'node', u'ssh_str': u'vamsee@Linuxbox', u'desc': None}, u'license': True, u'hostname': u'Linuxbox', u'slots': {u'GPU': [0], u'RAM': [0, 1, 2], u'CPU': [0, 1, 2, 3]}, u'fixed': {u'SSD': True}, u'lane_type': u'default', u'licenses_acquired': 1}
> ========= sending heartbeat
> ========= sending heartbeat
> ========= sending heartbeat
> cryosparc2_compute/plotutil.py:244: RuntimeWarning: divide by zero encountered in log
> logabs = n.log(n.abs(fM))
> ========= sending heartbeat
> ========= sending heartbeat
> ========= sending heartbeat
> ========= sending heartbeat
> ========= sending heartbeat
> ========= sending heartbeat
> ========= sending heartbeat
> ========= sending heartbeat
> ========= sending heartbeat
> ========= sending heartbeat
> ========= sending heartbeat
> FSC No-Mask... ========= sending heartbeat
> 0.143 at 17.523 radwn. 0.5 at 13.301 radwn. Took 9.005s.
> FSC Spherical Mask... ========= sending heartbeat
> ========= sending heartbeat
> ========= sending heartbeat
> ========= sending heartbeat
> 0.143 at 21.571 radwn. 0.5 at 16.102 radwn. Took 13.670s.
> FSC Loose Mask... ========= sending heartbeat
> ========= sending heartbeat
> ========= sending heartbeat
> ========= sending heartbeat
> ========= sending heartbeat
> ========= sending heartbeat
> ========= main process now complete.
> ========= monitor process now complete.
Any idea what I could try next?