Only one preprocessing GPU used after other streaming jobs start

We are running cryoSPARC Live 3.2 on CentOS 7.7 and cuda 10.1 with 4x GTX 1080Ti cards in a single lane.

When we start a live session with, for example, 3 preprocessing GPU workers the session begins using 3. However, after another job starts (e.g., streaming 2D classification), only 1 GPU is used for preprocessing. If we pause and restart the session, again 3 GPUs are being used initially for preprocessing until other jobs are started.

Is this a known issue? Is there a solution to keep the specified number of GPUs dedicated to preprocessing?

Michael

Hi @mpurdy,

Thanks for the description of the issue - this behaviour you describe is definitely not intended. How are you able to tell only one GPU is used for preprocessing after another job starts? Is it that the Live worker job is killed or fails?

It would be helpful if you could include a screenshot of your configuration tab (specifically the compute resources section on the left). Additionally, please navigate to the main cryoSPARC interface, open the workspace corresponding to this Live session and report the first few logs for each of the Live worker jobs (up until the initial images). Finally, also within the main cryoSPARC interface, if you could open the job that seems to cause this shift in allocation and copy the contents of the metadata tab (see screenshot below), that would be appreciated. image

Thanks,
Suhail

Suhail, after further investigation, my description was incorrect.

When we start a live session with multiple preprocessing GPUs we see by the blue-highlighted micrograph thumbnails that the specified number of preprocessing jobs are running. We can also see that they are running in the “active jobs”. However, if Live finishes preprocessing the queued movies, only one preprocessing GPU is used when new movies accumulate in the queue. We can see this in the active jobs and highlighted thumbnails. If we pause and restart, the specified number of preprocessing GPUs are again used.

Michael

Hi @mpurdy, thanks for the details. One more question: when you see this phenomenon happening (there are 3 workers “running” but only one is “actively” processing movies), does the queue of movies waiting accumulate continuously?
i.e, is the data collection rate actually faster than the processing rate of one worker?
There are several steps/timings involved in workers fetching work from the queue so it is possible for the queue to have movies waiting (i.e. number queued > 0) but for that number to be just floating near zero, if one worker is sufficient to keep up with the collection rate

Once this happens, movies accumulate - that’s the problem. That is, if Live keeps up with the data coming in, then it stops using multiple preprocessors and falls behind (unless we restart).

Hi @mpurdy,
In that case, when you see this happening, can you copy and send us the streamlog (from the inspect modal of the job in the UI) of the workers that are still in “running” status but remain idle? The logs there should show some information about whether the workers are attempting to get new incoming movies but are not receiving any work, or are stuck in some other way.
Thanks!

Ali, yesterday we were running Live on a data collection with 3 preprocessing GPUs and running 2D streaming classification (on a 4 GPU worker). After several hours running with preprocessing keeping up with incoming data, I started an ab initio job. Two of the Live workers appear to have failed 20 minutes after the ab initio job started and the other Live worker failed when the ab initio job finished. Here are the logs from the 3 Live workers:

############## J8
[CPU: 649.6 MB] PROCESSING EXPOSURE 675 ===========================================================

[CPU: 649.6 MB] Reading exposure /data4/K3/20210615_b2g/raw/FoilHole_15323789_Data_15308417_15308419_20210615_203156_fractions.tiff and initializing .cs file…

[CPU: 649.6 MB] Traceback (most recent call last):
File “cryosparc_worker/cryosparc_compute/jobs/rtp_workers/run.py”, line 370, in cryosparc_compute.jobs.rtp_workers.run.rtp_worker
File “cryosparc_worker/cryosparc_compute/jobs/rtp_workers/run.py”, line 431, in cryosparc_compute.jobs.rtp_workers.run.process_movie
File “cryosparc_worker/cryosparc_compute/jobs/rtp_workers/run.py”, line 464, in cryosparc_compute.jobs.rtp_workers.run.do_check
File “cryosparc_worker/cryosparc_compute/jobs/rtp_workers/run.py”, line 169, in cryosparc_compute.jobs.rtp_workers.run.RTPExposureCache.cache_read
File “cryosparc_worker/cryosparc_compute/blobio/prefetch.py”, line 45, in cryosparc_compute.blobio.prefetch.Prefetch.get
RuntimeError: TIFFReadDirectory605: Input/output error

[CPU: 649.6 MB] No new exposure received since 167 seconds ago. Searching again in 10 seconds…

################ J9
[CPU: 655.0 MB] PROCESSING EXPOSURE 1140 ===========================================================

[CPU: 655.0 MB] Reading exposure /data4/K3/20210615_b2g3/raw/FoilHole_15332933_Data_15308417_15308419_20210615_233841_fractions.tiff and initializing .cs file…

[CPU: 655.1 MB] Traceback (most recent call last):
File “cryosparc_worker/cryosparc_compute/jobs/rtp_workers/run.py”, line 370, in cryosparc_compute.jobs.rtp_workers.run.rtp_worker
File “cryosparc_worker/cryosparc_compute/jobs/rtp_workers/run.py”, line 431, in cryosparc_compute.jobs.rtp_workers.run.process_movie
File “cryosparc_worker/cryosparc_compute/jobs/rtp_workers/run.py”, line 464, in cryosparc_compute.jobs.rtp_workers.run.do_check
File “cryosparc_worker/cryosparc_compute/jobs/rtp_workers/run.py”, line 169, in cryosparc_compute.jobs.rtp_workers.run.RTPExposureCache.cache_read
File “cryosparc_worker/cryosparc_compute/blobio/prefetch.py”, line 45, in cryosparc_compute.blobio.prefetch.Prefetch.get
RuntimeError: TIFFOpen 540: Input/output error

########## J10
License is valid.

Launching job on lane default target xxxx …

Job directory /data4/K3/20210615_b2g3/csparc2/P43/J10 is not empty, found: /data4/K3/20210615_Tan_b2g3/csparc2/P43/J10/.fuse_hidden0103e69300000138