HTTP Error 502: Bad Gateway

vamsee · December 5, 2020, 1:34am

I previously had a workstation with 3x 2080Tis and Linuxmint 20 (Ulyana) installed on a 1TB SSD. I took the SSD out of that workstation and installed it into another workstation which has 1x 2080Ti and 1x 780Ti (only used for display). The OS booted up fine and has been running all (almost) the software I’ve previously used. The exception to this is cryosparc. I have tried running both NU-R and Homogenous refinements. Both fail while calculating FSCs. Seems like the cryosparc instance is not able to connect to the GPU.

Troubleshooting steps I’ve tried

Uninstalled and reinstalled CUDA (10.2) without errors
Uninstalled and reinstalled cryosparc without errors

Output from cryosparcm log command_core

[IMPORT_PROJECT] :     Uploading project image data...
[IMPORT_PROJECT] :     Done. Uploaded 0 files in 0.00s
[IMPORT_PROJECT] :     Inserted job document in 0.00s...
[IMPORT_PROJECT] :     Inserting streamlogs into jobs...
[IMPORT_PROJECT] :     Done. Inserted 0 streamlogs in 0.00s...
[IMPORT_PROJECT] :   Imported J27 into P1 in 0.00s...
[IMPORT_PROJECT] :     Uploading project image data...
[IMPORT_PROJECT] :     Done. Uploaded 0 files in 0.00s
[IMPORT_PROJECT] :     Inserted job document in 0.01s...
[IMPORT_PROJECT] :     Inserting streamlogs into jobs...
[IMPORT_PROJECT] :     Done. Inserted 0 streamlogs in 0.00s...
[IMPORT_PROJECT] :   Imported J3 into P1 in 0.01s...
[IMPORT_PROJECT] :     Uploading project image data...
[IMPORT_PROJECT] :     Done. Uploaded 0 files in 0.00s
[IMPORT_PROJECT] :     Inserted job document in 0.00s...
[IMPORT_PROJECT] :     Inserting streamlogs into jobs...
[IMPORT_PROJECT] :     Done. Inserted 0 streamlogs in 0.00s...
[IMPORT_PROJECT] :   Imported J4 into P1 in 0.00s...
[IMPORT_PROJECT] :     Uploading project image data...
[IMPORT_PROJECT] :     Done. Uploaded 4 files in 0.03s
[IMPORT_PROJECT] :     Inserted job document in 0.04s...
[IMPORT_PROJECT] :     Inserting streamlogs into jobs...
[IMPORT_PROJECT] :     Done. Inserted 25 streamlogs in 0.03s...
[IMPORT_PROJECT] :   Imported J5 into P1 in 0.07s...
[IMPORT_PROJECT] :     Uploading project image data...
[IMPORT_PROJECT] :     Done. Uploaded 8 files in 0.04s
[IMPORT_PROJECT] :     Inserted job document in 0.05s...
[IMPORT_PROJECT] :     Inserting streamlogs into jobs...
[IMPORT_PROJECT] :     Done. Inserted 28 streamlogs in 0.02s...
[IMPORT_PROJECT] :   Imported J6 into P1 in 0.07s...
[IMPORT_PROJECT] :     Uploading project image data...
[IMPORT_PROJECT] :     Done. Uploaded 8 files in 0.04s
[IMPORT_PROJECT] :     Inserted job document in 0.05s...
[IMPORT_PROJECT] :     Inserting streamlogs into jobs...
[IMPORT_PROJECT] :     Done. Inserted 28 streamlogs in 0.02s...
[IMPORT_PROJECT] :   Imported J7 into P1 in 0.07s...
[IMPORT_PROJECT] :     Uploading project image data...
[IMPORT_PROJECT] :     Done. Uploaded 8 files in 0.03s
[IMPORT_PROJECT] :     Inserted job document in 0.05s...
[IMPORT_PROJECT] :     Inserting streamlogs into jobs...
[IMPORT_PROJECT] :     Done. Inserted 28 streamlogs in 0.03s...
[IMPORT_PROJECT] :   Imported J8 into P1 in 0.07s...
[IMPORT_PROJECT] :     Uploading project image data...
[IMPORT_PROJECT] :     Done. Uploaded 8 files in 0.05s
[IMPORT_PROJECT] :     Inserted job document in 0.06s...
[IMPORT_PROJECT] :     Inserting streamlogs into jobs...
[IMPORT_PROJECT] :     Done. Inserted 28 streamlogs in 0.01s...
[IMPORT_PROJECT] :   Imported J9 into P1 in 0.07s...
[IMPORT_PROJECT] : Imported project from /mnt/12T_HDD2/XX/P12 as P1 in 4.03s
[EXPORT_PROJECT] : Exporting project P1...
[EXPORT_PROJECT] : Exported project P1 to /mnt/12T_HDD2/XX/P12/project.json in 0.00s
---- Deleting project UID P1 job UID J27 
     Now clearing job..
[EXPORT_JOB] : Request to export P1 J27
[EXPORT_JOB] :    Exporting job to /mnt/12T_HDD2/XX/P12/J27
[EXPORT_JOB] :    Exporting all of job's images in the database to /mnt/12T_HDD2/XX/P12/J27/gridfs_data...
[EXPORT_JOB] :    Done. Exported 0 images in 0.00s
[EXPORT_JOB] :    Exporting all job's streamlog events...
[EXPORT_JOB] :    Done. Exported 1 files in 0.00s
[EXPORT_JOB] :    Exporting job metafile...
[EXPORT_JOB] :    Done. Exported in 0.01s
[EXPORT_JOB] :    Updating job manifest...
[EXPORT_JOB] :    Done. Updated in 0.00s
[EXPORT_JOB] : Exported P1 J27 in 0.03s
[EXPORT_PROJECT] : Exporting project P1...
[EXPORT_PROJECT] : Exported project P1 to /mnt/12T_HDD2/XX/P12/project.json in 0.00s
[EXPORT_JOB] : Request to export P1 J28
[EXPORT_JOB] :    Exporting job to /mnt/12T_HDD2/XX/P12/J28
[EXPORT_JOB] :    Exporting all of job's images in the database to /mnt/12T_HDD2/XX/P12/J28/gridfs_data...
[EXPORT_JOB] :    Done. Exported 0 images in 0.00s
[EXPORT_JOB] :    Exporting all job's streamlog events...
[EXPORT_JOB] :    Done. Exported 1 files in 0.00s
[EXPORT_JOB] :    Exporting job metafile...
[EXPORT_JOB] :    Done. Exported in 0.00s
[EXPORT_JOB] :    Updating job manifest...
[EXPORT_JOB] :    Done. Updated in 0.00s
[EXPORT_JOB] : Exported P1 J28 in 0.01s
[EXPORT_JOB] : Request to export P1 J22
[EXPORT_JOB] :    Exporting job to /mnt/12T_HDD2/XX/P12/J22
[EXPORT_JOB] :    Exporting all of job's images in the database to /mnt/12T_HDD2/XXP12/J22/gridfs_data...
[EXPORT_JOB] : Request to export P1 J22
[EXPORT_JOB] :    Exporting job to /mnt/12T_HDD2/XX/P12/J22
[EXPORT_JOB] :    Exporting all of job's images in the database to /mnt/12T_HDD2/XX/P12/J22/gridfs_data...
[EXPORT_JOB] :    Writing 167 database images to /mnt/12T_HDD2/XX/P12/J22/gridfs_data/gridfsdata_0
[EXPORT_JOB] :    Done. Exported 167 images in 0.39s
[EXPORT_JOB] :    Exporting all job's streamlog events...
[EXPORT_JOB] :    Done. Exported 1 files in 0.01s
[EXPORT_JOB] :    Exporting job metafile...
[EXPORT_JOB] :    Creating .csg file for particles
[EXPORT_JOB] :    Creating .csg file for volume
[EXPORT_JOB] :    Creating .csg file for mask
[EXPORT_JOB] :    Done. Exported in 0.02s
[EXPORT_JOB] :    Updating job manifest...
[EXPORT_JOB] :    Done. Updated in 0.00s
[EXPORT_JOB] : Exported P1 J22 in 0.42s
[EXPORT_JOB] : Request to export P1 J28
[EXPORT_JOB] :    Exporting job to /mnt/12T_HDD2/XX/P12/J28
[EXPORT_JOB] :    Exporting all of job's images in the database to /mnt/12T_HDD2/XX/P12/J28/gridfs_data...
[EXPORT_JOB] :    Done. Exported 0 images in 0.00s
[EXPORT_JOB] :    Exporting all job's streamlog events...
[EXPORT_JOB] :    Done. Exported 1 files in 0.00s
[EXPORT_JOB] :    Exporting job metafile...
[EXPORT_JOB] :    Done. Exported in 0.00s
[EXPORT_JOB] :    Updating job manifest...
[EXPORT_JOB] :    Done. Updated in 0.00s
[EXPORT_JOB] : Exported P1 J28 in 0.01s
---------- Scheduler running --------------- 
Jobs Queued:  [(u'P1', u'J28')]
Licenses currently active : 0
Now trying to schedule J28
  Need slots :  {u'GPU': 1, u'RAM': 3, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
   Running job directly on GPU id(s): [0] on Linuxbox
Failed to connect link: HTTP Error 502: Bad Gateway
Not a commercial instance - heartbeat set to 12 hours.
     Launchable! -- Launching.
Changed job P1.J28 status launched
---------- Scheduler finished --------------- 
Changed job P1.J28 status started
Changed job P1.J28 status running
---- Killing project UID P1 job UID J28 
     Killing job on worker type node Linuxbox
     Killing job on worker on same node as master, not using ssh
Changed job P1.J28 status killed
[EXPORT_JOB] : Request to export P1 J28
[EXPORT_JOB] :    Exporting job to /mnt/12T_HDD2/XX/P12/J28
[EXPORT_JOB] :    Exporting all of job's images in the database to /mnt/12T_HDD2/XX/P12/J28/gridfs_data...
[EXPORT_JOB] :    Done. Exported 0 images in 0.00s
[EXPORT_JOB] :    Exporting all job's streamlog events...
[EXPORT_JOB] :    Done. Exported 1 files in 0.00s
[EXPORT_JOB] :    Exporting job metafile...
[EXPORT_JOB] :    Done. Exported in 0.01s
[EXPORT_JOB] :    Updating job manifest...
[EXPORT_JOB] :    Done. Updated in 0.00s
[EXPORT_JOB] : Exported P1 J28 in 0.03s
[EXPORT_JOB] : Request to export P1 J28
[EXPORT_JOB] :    Exporting job to /mnt/12T_HDD2/XX/P12/J28
[EXPORT_JOB] :    Exporting all of job's images in the database to /mnt/12T_HDD2/XX/P12/J28/gridfs_data...
[EXPORT_JOB] :    Done. Exported 0 images in 0.00s
[EXPORT_JOB] :    Exporting all job's streamlog events...
[EXPORT_JOB] :    Done. Exported 1 files in 0.00s
[EXPORT_JOB] :    Exporting job metafile...
[EXPORT_JOB] :    Done. Exported in 0.00s
[EXPORT_JOB] :    Updating job manifest...
[EXPORT_JOB] :    Done. Updated in 0.00s
[EXPORT_JOB] : Exported P1 J28 in 0.01s
---------- Scheduler running --------------- 
Jobs Queued:  [(u'P1', u'J28')]
Licenses currently active : 0
Now trying to schedule J28
  Need slots :  {u'GPU': 1, u'RAM': 3, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
   Running job directly on GPU id(s): [0] on Linuxbox
Failed to connect link: HTTP Error 502: Bad Gateway
Not a commercial instance - heartbeat set to 12 hours.
     Launchable! -- Launching.
Changed job P1.J28 status launched
      Running project UID P1 job UID J28 
        Running job on worker type node
        Running job using:  /home/vamsee/software/cryosparc/cryosparc2_worker/bin/cryosparcw
---------- Scheduler finished --------------- 
Changed job P1.J28 status started
Changed job P1.J28 status running
Changed job P1.J28 status failed
COMMAND CORE STARTED ===  2020-12-04 16:57:59.125352  ==========================
*** BG WORKER START
[EXPORT_PROJECT] : Exporting project P1...
[EXPORT_PROJECT] : Exported project P1 to /mnt/12T_HDD2/XX/P12/project.json in 0.03s
[EXPORT_JOB] : Request to export P1 J28
[EXPORT_JOB] :    Exporting job to /mnt/12T_HDD2/XX/P12/J28
[EXPORT_JOB] :    Exporting all of job's images in the database to /mnt/12T_HDD2/XX/P12/J28/gridfs_data...
[EXPORT_JOB] :    Done. Exported 0 images in 0.00s
[EXPORT_JOB] :    Exporting all job's streamlog events...
[EXPORT_JOB] :    Done. Exported 1 files in 0.00s
[EXPORT_JOB] :    Exporting job metafile...
[EXPORT_JOB] :    Done. Exported in 0.00s
[EXPORT_JOB] :    Updating job manifest...
[EXPORT_JOB] :    Done. Updated in 0.21s
[EXPORT_JOB] : Exported P1 J28 in 0.22s
---------- Scheduler running --------------- 
Jobs Queued:  [(u'P1', u'J28')]
Licenses currently active : 0
Now trying to schedule J28
  Need slots :  {u'GPU': 1, u'RAM': 3, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
   Scheduling job to Linuxbox
Failed to connect link: HTTP Error 502: Bad Gateway
Not a commercial instance - heartbeat set to 12 hours.
     Launchable! -- Launching.
Changed job P1.J28 status launched
      Running project UID P1 job UID J28 
        Running job on worker type node
        Running job using:  /home/vamsee/software/cryosparc/cryosparc2_worker/bin/cryosparcw
---------- Scheduler finished --------------- 
Changed job P1.J28 status started
Changed job P1.J28 status running
Changed job P1.J28 status failed

Output from cryosparcm joblog P1 J28

>     ================= CRYOSPARCW =======  2020-12-04 17:01:57.033313  =========
> Project P1 Job J28
> Master Linuxbox Port 39002
> ===========================================================================
> ========= monitor process now starting main process
> MAINPROCESS PID 14716
> ========= monitor process now waiting for main process
> MAIN PID 14716
> nonuniform_refine.run cryosparc2_compute.jobs.jobregister
> /home/vamsee/software/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/skcuda/cublas.py:284: UserWarning: creating CUBLAS context to get version number
>   warnings.warn('creating CUBLAS context to get version number')
> ========= sending heartbeat
> ========= sending heartbeat
> ========= sending heartbeat
> ========= sending heartbeat
> ========= sending heartbeat
> ========= sending heartbeat
> ========= sending heartbeat
> ========= sending heartbeat
> ========= sending heartbeat
> ***************************************************************
> Running job  J28  of type  nonuniform_refine
> Running job on hostname %s Linuxbox
> Allocated Resources :  {u'lane': u'default', u'target': {u'monitor_port': None, u'lane': u'default', u'name': u'Linuxbox', u'title': u'Worker node Linuxbox', u'resource_slots': {u'GPU': [0, 1], u'RAM': [0, 1, 2, 3], u'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]}, u'hostname': u'Linuxbox', u'worker_bin_path': u'/home/vamsee/software/cryosparc/cryosparc2_worker/bin/cryosparcw', u'cache_path': u'/mnt/NV_HDD', u'cache_quota_mb': None, u'resource_fixed': {u'SSD': True}, u'gpus': [{u'mem': 11554717696, u'id': 0, u'name': u'GeForce RTX 2080 Ti'}, {u'mem': 3168468992, u'id': 1, u'name': u'GeForce GTX 780 Ti'}], u'cache_reserve_mb': 10000, u'type': u'node', u'ssh_str': u'vamsee@Linuxbox', u'desc': None}, u'license': True, u'hostname': u'Linuxbox', u'slots': {u'GPU': [0], u'RAM': [0, 1, 2], u'CPU': [0, 1, 2, 3]}, u'fixed': {u'SSD': True}, u'lane_type': u'default', u'licenses_acquired': 1}
> ========= sending heartbeat
> ========= sending heartbeat
> ========= sending heartbeat
> cryosparc2_compute/plotutil.py:244: RuntimeWarning: divide by zero encountered in log
>   logabs = n.log(n.abs(fM))
> ========= sending heartbeat
> ========= sending heartbeat
> ========= sending heartbeat
> ========= sending heartbeat
> ========= sending heartbeat
> ========= sending heartbeat
> ========= sending heartbeat
> ========= sending heartbeat
> ========= sending heartbeat
> ========= sending heartbeat
> ========= sending heartbeat
> FSC No-Mask...        ========= sending heartbeat
>  0.143 at 17.523 radwn. 0.5 at 13.301 radwn. Took 9.005s.
> FSC Spherical Mask... ========= sending heartbeat
> ========= sending heartbeat
> ========= sending heartbeat
> ========= sending heartbeat
>  0.143 at 21.571 radwn. 0.5 at 16.102 radwn. Took 13.670s.
> FSC Loose Mask...     ========= sending heartbeat
> ========= sending heartbeat
> ========= sending heartbeat
> ========= sending heartbeat
> ========= sending heartbeat
> ========= sending heartbeat
> ========= main process now complete.
> ========= monitor process now complete.

Any idea what I could try next?

apunjani · December 16, 2020, 3:31pm

Hi @vamsee,

I there are error message (traceback) in the streamlog of the job?
Also, how much CPU RAM did your old machine have, and how much does the new one have?
Also what is the box size of particles you are processing?

vamsee · December 16, 2020, 4:48pm

@apunjani That was another thing I was worried about. My previous machine had 256GB of RAM. The new one has only 32GB (planning to upgrade soon). There are several extraction jobs in the projects at different box sizes ranging from 128 to 512.