Only one preprocessing GPU used after other streaming jobs start

We are running cryoSPARC Live 3.2 on CentOS 7.7 and cuda 10.1 with 4x GTX 1080Ti cards in a single lane.

When we start a live session with, for example, 3 preprocessing GPU workers the session begins using 3. However, after another job starts (e.g., streaming 2D classification), only 1 GPU is used for preprocessing. If we pause and restart the session, again 3 GPUs are being used initially for preprocessing until other jobs are started.

Is this a known issue? Is there a solution to keep the specified number of GPUs dedicated to preprocessing?

Michael

Hi @mpurdy,

Thanks for the description of the issue - this behaviour you describe is definitely not intended. How are you able to tell only one GPU is used for preprocessing after another job starts? Is it that the Live worker job is killed or fails?

It would be helpful if you could include a screenshot of your configuration tab (specifically the compute resources section on the left). Additionally, please navigate to the main cryoSPARC interface, open the workspace corresponding to this Live session and report the first few logs for each of the Live worker jobs (up until the initial images). Finally, also within the main cryoSPARC interface, if you could open the job that seems to cause this shift in allocation and copy the contents of the metadata tab (see screenshot below), that would be appreciated. image

Thanks,
Suhail

Suhail, after further investigation, my description was incorrect.

When we start a live session with multiple preprocessing GPUs we see by the blue-highlighted micrograph thumbnails that the specified number of preprocessing jobs are running. We can also see that they are running in the “active jobs”. However, if Live finishes preprocessing the queued movies, only one preprocessing GPU is used when new movies accumulate in the queue. We can see this in the active jobs and highlighted thumbnails. If we pause and restart, the specified number of preprocessing GPUs are again used.

Michael

Hi @mpurdy, thanks for the details. One more question: when you see this phenomenon happening (there are 3 workers “running” but only one is “actively” processing movies), does the queue of movies waiting accumulate continuously?
i.e, is the data collection rate actually faster than the processing rate of one worker?
There are several steps/timings involved in workers fetching work from the queue so it is possible for the queue to have movies waiting (i.e. number queued > 0) but for that number to be just floating near zero, if one worker is sufficient to keep up with the collection rate

Once this happens, movies accumulate - that’s the problem. That is, if Live keeps up with the data coming in, then it stops using multiple preprocessors and falls behind (unless we restart).

Hi @mpurdy,
In that case, when you see this happening, can you copy and send us the streamlog (from the inspect modal of the job in the UI) of the workers that are still in “running” status but remain idle? The logs there should show some information about whether the workers are attempting to get new incoming movies but are not receiving any work, or are stuck in some other way.
Thanks!

Ali, yesterday we were running Live on a data collection with 3 preprocessing GPUs and running 2D streaming classification (on a 4 GPU worker). After several hours running with preprocessing keeping up with incoming data, I started an ab initio job. Two of the Live workers appear to have failed 20 minutes after the ab initio job started and the other Live worker failed when the ab initio job finished. Here are the logs from the 3 Live workers:

############## J8
[CPU: 649.6 MB] PROCESSING EXPOSURE 675 ===========================================================

[CPU: 649.6 MB] Reading exposure /data4/K3/20210615_b2g/raw/FoilHole_15323789_Data_15308417_15308419_20210615_203156_fractions.tiff and initializing .cs file…

[CPU: 649.6 MB] Traceback (most recent call last):
File “cryosparc_worker/cryosparc_compute/jobs/rtp_workers/run.py”, line 370, in cryosparc_compute.jobs.rtp_workers.run.rtp_worker
File “cryosparc_worker/cryosparc_compute/jobs/rtp_workers/run.py”, line 431, in cryosparc_compute.jobs.rtp_workers.run.process_movie
File “cryosparc_worker/cryosparc_compute/jobs/rtp_workers/run.py”, line 464, in cryosparc_compute.jobs.rtp_workers.run.do_check
File “cryosparc_worker/cryosparc_compute/jobs/rtp_workers/run.py”, line 169, in cryosparc_compute.jobs.rtp_workers.run.RTPExposureCache.cache_read
File “cryosparc_worker/cryosparc_compute/blobio/prefetch.py”, line 45, in cryosparc_compute.blobio.prefetch.Prefetch.get
RuntimeError: TIFFReadDirectory605: Input/output error

[CPU: 649.6 MB] No new exposure received since 167 seconds ago. Searching again in 10 seconds…

################ J9
[CPU: 655.0 MB] PROCESSING EXPOSURE 1140 ===========================================================

[CPU: 655.0 MB] Reading exposure /data4/K3/20210615_b2g3/raw/FoilHole_15332933_Data_15308417_15308419_20210615_233841_fractions.tiff and initializing .cs file…

[CPU: 655.1 MB] Traceback (most recent call last):
File “cryosparc_worker/cryosparc_compute/jobs/rtp_workers/run.py”, line 370, in cryosparc_compute.jobs.rtp_workers.run.rtp_worker
File “cryosparc_worker/cryosparc_compute/jobs/rtp_workers/run.py”, line 431, in cryosparc_compute.jobs.rtp_workers.run.process_movie
File “cryosparc_worker/cryosparc_compute/jobs/rtp_workers/run.py”, line 464, in cryosparc_compute.jobs.rtp_workers.run.do_check
File “cryosparc_worker/cryosparc_compute/jobs/rtp_workers/run.py”, line 169, in cryosparc_compute.jobs.rtp_workers.run.RTPExposureCache.cache_read
File “cryosparc_worker/cryosparc_compute/blobio/prefetch.py”, line 45, in cryosparc_compute.blobio.prefetch.Prefetch.get
RuntimeError: TIFFOpen 540: Input/output error

########## J10
License is valid.

Launching job on lane default target xxxx …

Job directory /data4/K3/20210615_b2g3/csparc2/P43/J10 is not empty, found: /data4/K3/20210615_Tan_b2g3/csparc2/P43/J10/.fuse_hidden0103e69300000138

Here is the current session in which 1 of 2 preprocessing workers stopped after Live caught up with the incoming data:

Here is the log from the worker that failed:
Job directory /data4/K3/20210615_b2g3/csparc2/P43/J20 is not empty, found: /data4/K3/20210615_Tan_b2g3/csparc2/P43/J20/.fuse_hidden010637bc00000145

Hi @mpurdy,

Interesting… Is /data4 a network-attached filesystem? If so, do you have any details about how it’s configured (i.e. what filesystem is it, what’s the connection quality between the processing computer and the file server, etc)?

Thanks,
Harris

Harris, yes, the storage is zfs connected to the processing computer over the network using sshfs (I guess that is the source of the unexpected file, .fuse_hidden). Here are some stats on the connection:

[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 1.09 GBytes 937 Mbits/sec

1000 packets transmitted, 1000 received, 0% packet loss, time 10019ms
rtt min/avg/max/mdev = 0.044/0.633/1.781/0.466 ms

Thanks

Hi @mpurdy,

We’ve had several reports of obscure problems using sshfs, unfortunately. If you’re able to switch over to using NFS instead, that will likely resolve these issues.

Sorry I can’t be of more help.
–Harris

Harris, o.k., I established a direct connection between the data server and workstation and mounted the filesystems with NFS. Hopefully this will solve the problem. (And the connection is better.)

Thanks to all of you for the help sorting this out.

Michael

2 Likes

Well, we have the same issue with the NFS mount, i.e., after ~1500 movies 1 of 2 preprocessing workers stopped. Here is the output from that worker:

Overview:
[CPU: 670.6 MB] PROCESSING EXPOSURE 1440 ===========================================================

[CPU: 670.6 MB] Reading exposure /data1/K3/20210622_xxx/raw/FoilHole_22406818_Data_22386506_22386508_20210622_205602_fractions.tiff and initializing .cs file…

Metadata:

{
   "_id":{
      "_str":"60d203035e783e01d5da882f"
   },
   "created_at":"2021-06-22T15:34:27.859Z",
   "deleted":false,
   "project_uid":"P46",
   "status":"running",
   "type":"rtp_worker",
   "uid":"J2",
   "workspace_uids":[
      "W1"
   ],
   "children":[
      
   ],
   "cloned_from":null,
   "parents":[
      "J1"
   ],
   "queue_message":null,
   "title":"New Job J2",
   "ui_tile_height":1,
   "ui_tile_images":[
      {
         "name":"mic0",
         "fileid":"60d287ce04683350ff20ef69",
         "num_rows":1,
         "num_cols":1
      }
   ],
   "ui_tile_width":1,
   "job_type":"rtp_worker",
   "completed_at":null,
   "created_by_job_uid":null,
   "created_by_user_id":"607f14dea30bf3defa66c693",
   "description":"Enter a description.",
   "failed_at":"2021-06-23T00:50:09.394Z",
   "interactive":false,
   "job_dir_size":0,
   "killed_at":null,
   "last_accessed":{
      "name":"Purdy",
      "accessed_at":"2021-06-23T13:21:37.567Z"
   },
   "launched_at":"2021-06-23T00:50:12.426Z",
   "priority":0,
   "queued_at":"2021-06-23T00:50:12.017Z",
   "started_at":"2021-06-23T00:50:14.136Z",
   "waiting_at":null,
   "version":"v3.2.0",
   "run_as_user":null,
   "params_secs":{
      "compute_settings":{
         "title":"Compute settings",
         "desc":"",
         "order":0,
         "name":"compute_settings"
      }
   },
   "params_base":{
      "session_uid":{
         "type":"string",
         "value":"S1",
         "title":"cryoSPARC Live Session ID",
         "desc":"",
         "order":0,
         "section":"compute_settings",
         "advanced":false,
         "hidden":false,
         "name":"session_uid",
         "is_default":false
      },
      "lane_name":{
         "type":"string",
         "value":"default",
         "title":"cryoSPARC Live Lane Name",
         "desc":"",
         "order":1,
         "section":"compute_settings",
         "advanced":false,
         "hidden":false,
         "name":"lane_name",
         "is_default":false
      },
      "interval":{
         "type":"number",
         "value":10,
         "title":"cryoSPARC Live Exposure Search Interval",
         "desc":"",
         "order":2,
         "section":"compute_settings",
         "advanced":false,
         "hidden":false,
         "name":"interval",
         "is_default":true
      }
   },
   "params_spec":{
      "session_uid":{
         "value":"S1"
      },
      "lane_name":{
         "value":"default"
      }
   },
   "input_slot_groups":[
      {
         "type":"live",
         "name":"live",
         "title":"Live Session",
         "description":"",
         "count_min":0,
         "count_max":1,
         "repeat_allowed":false,
         "slots":[
            {
               "type":"live.session_info",
               "name":"session_info",
               "title":"Session Info",
               "description":"",
               "optional":false
            }
         ],
         "connections":[
            {
               "job_uid":"J1",
               "group_name":"live",
               "slots":[
                  {
                     "slot_name":"session_info",
                     "job_uid":"J1",
                     "group_name":"live",
                     "result_name":"session_info",
                     "result_type":"live.session_info",
                     "version":"F"
                  }
               ]
            }
         ]
      }
   ],
   "output_result_groups":[
      
   ],
   "output_results":[
      
   ],
   "output_group_images":{
      "particles":"60d287ce04683350ff20ef6b"
   },
   "errors_build_params":{
      
   },
   "errors_build_inputs":{
      
   },
   "errors_run":[
      
   ],
   "running_at":"2021-06-23T00:50:25.811Z",
   "token_acquired_at":null,
   "tokens_requested_at":null,
   "last_scheduled_at":null,
   "resources_needed":{
      "slots":{
         "CPU":6,
         "GPU":1,
         "RAM":2
      },
      "fixed":{
         "SSD":false
      }
   },
   "resources_allocated":{
      "lane":"default",
      "lane_type":"default",
      "hostname":"xxx",
      "target":{
         "type":"node",
         "lane":"default",
         "name":"xxx",
         "title":"Worker node xxx",
         "desc":null,
         "hostname":"xxx",
         "ssh_str":"xxx",
         "worker_bin_path":"/ws/local/progs/csparc/cryosparc_worker/bin/cryosparcw",
         """resource_slots""":{
            """CPU""":[
               0,
               1,
               2,
               3,
               4,
               5,
               6,
               7,
               8,
               9,
               10,
               11,
               12,
               13,
               14,
               15,
               16,
               17,
               18,
               19,
               20,
               21,
               22,
               23,
               24,
               25,
               26,
               27,
               28,
               29,
               30,
               31
            ],
            """GPU""":[
               0,
               1,
               2,
               3
            ],
            """RAM""":[
               0,
               1,
               2,
               3,
               4,
               5,
               6,
               7,
               8,
               9,
               10,
               11,
               12,
               13,
               14,
               15
            ]
         },
         """resource_fixed""":{
            """SSD""":true
         },
         """cache_path""":"/mnt/ssd970/",
         """cache_reserve_mb""":10000,
         """cache_quota_mb""":null,
         """monitor_port""":null,
         """gpus""":[
            {
               """id""":0,
               """name""":"GeForce GTX 1080 Ti",
               """mem""":11721113600
            },
            {
               """id""":1,
               """name""":"GeForce GTX 1080 Ti",
               """mem""":11721506816
            },
            {
               """id""":2,
               """name""":"GeForce GTX 1080 Ti",
               """mem""":11721506816
            },
            {
               """id""":3,
               """name""":"GeForce GTX 1080 Ti",
               """mem""":11721506816
            }
         ]
      },
      """slots""":{
         """CPU""":[
            0,
            1,
            2,
            3,
            4,
            5
         ],
         """GPU""":[
            0
         ],
         """RAM""":[
            0,
            1
         ]
      },
      """fixed""":{
         """SSD""":false
      },
      """license""":true,
      """licenses_acquired""":1
   },
   """run_on_master_direct""":false,
   """queued_to_lane""":"""default""",
   """queue_index""":null,
   """queue_status""":null,
   """queued_job_hash""":null,
   """interactive_hostname""":"""xxx""",
   """interactive_port""":null,
   """PID_monitor""":16553,
   """PID_main""":16554,
   """PID_workers""":[
      
   ],
   """cluster_job_id""":null,
   """is_experiment""":false,
   """job_dir""":"J2",
   """experiment_worker_path""":null,
   """enable_bench""":false,
   """bench""":{
      
   },
   """instance_information""":{
      """platform_node""":"""xxx""",
      """platform_release""":"3.10.0-1062.el7.x86_64",
      """platform_version""":""#1 SMP Wed Aug 7 18":"08":02 UTC 2019",
      """platform_architecture""":"x86_64",
      """physical_cores""":16,
      """max_cpu_freq""":2100,
      """total_memory""":"125.63GB",
      """available_memory""":"83.47GB",
      """used_memory""":"40.04GB",
      """gpu_info""":[
         {
            """id""":0,
            """name""":"GeForce GTX 1080 Ti",
            """mem""":11721113600
         }
      ],
      """CUDA_version""":"10.1.0"
   },
   """project_uid_num""":46,
   """uid_num""":2,
   """ui_layouts""":{
      "P46":{
         """show""":true,
         """floater""":false,
         """top""":232,
         """left""":1350,
         """width""":152,
         """height""":192,
         """groups""":[
            
         ]
      },
      "P46W1":{
         """show""":true,
         """floater""":false,
         """top""":232,
         """left""":1350,
         """width""":152,
         """height""":192,
         """groups""":[
            
         ]
      }
   },
   "last_exported""":""2021-06-23T00":"50":11.992Z",
   "queued_to_hostname""":false,
   "queued_to_gpu""":false,
   "no_check_inputs_ready""":false,
   "num_tokens""":1,
   "tokens_acquired_at""":1624409412.4197166,
   "status_num""":25,
   "progress""""\\"":[]}"

image

Hmmm… is it failing with the same “TIFFReadDirectory” error as before? If so, could you please post the output of csm joblog Pxx Jyy | tail -50 (where xx and yy are replaced with the project and job number)?

Thanks very much.
Harris

No, there is no error message, just the output I sent “Reading exposure … and initializing .cs file …”

cryosparcm joblog shows only “sending heartbeat” for the worker that is stalled and the one that is running.

Hi @mpurdy,

That being the case, I’m sorry to say we won’t be able to debug this remotely, and unfortunately we haven’t been able to reproduce this issue ourselves. However, I should mention that many users who have hard-to-explain problems on CentOS 7 have found that the issues disappear when they move to a more modern operating system. We use Ubuntu 20.X internally, and all of our pre-release testing takes place on that platform. I would recommend that if possible, you try upgrading the OS.

–Harris

O.k., good to know. We have an Ubuntu 20.04 workstation we can move our Live processing onto while I migrate the one in question.