Cryosparc 3.0.1 jobs use all available GPUs

gabor · January 20, 2021, 3:46pm

Dear all,

After upgrading to the version 3.0.1 I noticed that all launched jobs started to use all available GPUs. Before each job used its designated GPU and the queuing worked well. Now jobs are getting really slow and I wonder whether this behavior is the reason why. Is this the expected method of job distribution?
As an example, I launched one job on the GPU 0 and I see this in nvidia-smi:

-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.45.01    Driver Version: 455.45.01    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:41:00.0 Off |                    0 |
| N/A   53C    P0    29W /  70W |    166MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            Off  | 00000000:61:00.0 Off |                    0 |
| N/A   51C    P0    33W /  70W |    640MiB / 15109MiB |     16%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            Off  | 00000000:81:00.0 Off |                    0 |
| N/A   47C    P8    10W /  70W |      3MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla T4            Off  | 00000000:C1:00.0 Off |                    0 |
| N/A   53C    P8    10W /  70W |      3MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     61783      C   python                            163MiB |
|    1   N/A  N/A     61783      C   python                            647MiB |
|    2   N/A  N/A     61783      C   python                              0MiB |
|    3   N/A  N/A     61783      C   python                              0MiB |
+-----------------------------------------------------------------------------+

I would greatly appreciate your help!

Best regards,
Gabor

apunjani · January 20, 2021, 5:49pm

Hi @gabor, this is really strange - can you tell us your OS version? We have not seen this behaviour before. What job type was this that you launched?

gabor · January 21, 2021, 9:57am

Dear Ali,

The OS is Ubuntu 18.04.4, with 4 NVIDIA T4 cards. Right now patch motion correction (multi), nonuniform refinement and homo_refinement (new) jobs are running.
This is the output of nvidia-smi:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.45.01    Driver Version: 455.45.01    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:41:00.0 Off |                    0 |
| N/A   61C    P0    31W /  70W |    552MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            Off  | 00000000:61:00.0 Off |                    0 |
| N/A   59C    P0    30W /  70W |    218MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            Off  | 00000000:81:00.0 Off |                    0 |
| N/A   60C    P0    30W /  70W |    166MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla T4            Off  | 00000000:C1:00.0 Off |                    0 |
| N/A   40C    P8     9W /  70W |      3MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     63296      C   python                            163MiB |
|    0   N/A  N/A     65070      C   python                            221MiB |
|    0   N/A  N/A     67892      C   python                            163MiB |
|    1   N/A  N/A     63296      C   python                            215MiB |
|    1   N/A  N/A     65070      C   python                              0MiB |
|    1   N/A  N/A     67892      C   python                              0MiB |
|    2   N/A  N/A     63296      C   python                              0MiB |
|    2   N/A  N/A     65070      C   python                              0MiB |
|    2   N/A  N/A     67892      C   python                            159MiB |
|    3   N/A  N/A     63296      C   python                              0MiB |
|    3   N/A  N/A     65070      C   python                              0MiB |
|    3   N/A  N/A     67892      C   python                              0MiB |
+-----------------------------------------------------------------------------+

Thank you in advance!

Best regards,
Gabor

hsnyder · March 19, 2021, 6:45pm

Hi @gabor,

just to give you an update, we have looked into this and we think we know what is going on. unfortunately I cannot give you an exact time line on when this will be fixed, but it is going to be fixed in a future release.

–Harris

spunjani · March 30, 2021, 6:39pm

A fix for this is included in v3.2, released March 29, 2021. Release notes available here: https://cryosparc.com/updates