Test Worker GPUs - Job process terminated abnormally

Wanl · November 24, 2024, 6:15pm

hi, CryoSPARC is new for me .I followed the instructions carefully, but when I run “Test Worker GPUs”, I get a error, and here is the job.log:

================= CRYOSPARCW =======  2024-11-25 01:23:14.899633  =========
Project P4 Job J21
Master 1pp1o48b1qpqq-0 Port 39002
===========================================================================
MAIN PROCESS PID 1704
========= now starting main process at 2024-11-25 01:23:14.900383
instance_testing.run cryosparc_compute.jobs.jobregister
MONITOR PROCESS PID 1706
========= monitor process now waiting for main process
========= sending heartbeat at 2024-11-25 01:23:16.429488
========= sending heartbeat at 2024-11-25 01:23:26.453519
========= sending heartbeat at 2024-11-25 01:23:36.477309
========= sending heartbeat at 2024-11-25 01:23:46.496725
========= sending heartbeat at 2024-11-25 01:23:56.516580
========= sending heartbeat at 2024-11-25 01:24:06.535762
========= sending heartbeat at 2024-11-25 01:24:16.553652
========= sending heartbeat at 2024-11-25 01:24:26.577823
========= sending heartbeat at 2024-11-25 01:24:36.596749
========= sending heartbeat at 2024-11-25 01:24:46.615605
gpu_partition - [InitCudaPidMap] warn:recv response msg len mismatch, rcv_len = -1, errno = 11
gpu_partition - [BackTraceStack] warn:/usr/local/lib/inais/libgpu_partition.so(+0x13b1b) [0x7efc6cdc8b1b]
gpu_partition - [BackTraceStack] warn:/usr/local/lib/inais/libgpu_partition.so(cuDevicePrimaryCtxRetain+0x13b) [0x7efc6cdc952f]
gpu_partition - [BackTraceStack] warn:/home/cryo/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/cuda/_cuda/ccuda.cpython-310-x86_64-linux-gnu.so(+0x46ea3) [0x7efc4a6e9ea3]
gpu_partition - [BackTraceStack] warn:/home/cryo/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/cuda/ccuda.cpython-310-x86_64-linux-gnu.so(+0x149ea) [0x7efc4a7239ea]
gpu_partition - [BackTraceStack] warn:/home/cryo/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/cuda/cuda.cpython-310-x86_64-linux-gnu.so(+0x2873ef) [0x7efc4aa163ef]
Total: 91.579s
  MAIN THREAD:

========= main process now complete at 2024-11-25 01:24:56.634688.
========= monitor process now complete at 2024-11-25 01:24:56.678294.

can someone advise on what i should do ?

wtempel · November 25, 2024, 4:30pm

@Wanl Please can you post the output of the command

cryosparcm cli "get_job('P4', 'J21', 'job_type', 'version', 'instance_information', 'status',  'params_spec', 'errors_run')"

Wanl · November 26, 2024, 2:46am

@wtempel I’m glad to see your reply , here is the command’s output.

root@1pp1o48b1qpqq-0:~# cryosparcm cli "get_job('P4', 'J21', 'job_type', 'version', 'instance_information', 'status',  'params_spec', 'errors_run')"
{'_id': '674360e5e802c455d270a8c0', 'errors_run': [{'message': 'Job process terminated abnormally.', 'warning': False}], 'instance_information': {'CUDA_version': '11.8', 'available_memory': '1.41TB', 'cpu_model': 'Intel(R) Xeon(R) Platinum 8480+', 'driver_version': '12.2', 'gpu_info': [{'id': 0, 'mem': 84987740160, 'name': 'NVIDIA A800 80GB PCIe MIG 7g.80gb', 'pcie': '0000:89:00'}], 'ofd_hard_limit': 1048576, 'ofd_soft_limit': 1024, 'physical_cores': 112, 'platform_architecture': 'x86_64', 'platform_node': '1pp1o48b1qpqq-0', 'platform_release': '5.15.0-60-generic', 'platform_version': '#66-Ubuntu SMP Fri Jan 20 14:29:49 UTC 2023', 'total_memory': '1.48TB', 'used_memory': '63.73GB'}, 'job_type': 'worker_gpu_test', 'params_spec': {'test_pytorch': {'value': True}, 'test_tensorflow': {'value': True}}, 'project_uid': 'P4', 'status': 'failed', 'uid': 'J21', 'version': 'v4.6.2'}

We looked for the reason on the cluster and thought it might be due to the CUDA version causing the issue with the libgpu_partition.so file. Our CUDA version on the cluster is 12.2, and Cryosparc is 11.8. Will changing the CUDA version of Cryosparc cause problems?

wtempel · November 26, 2024, 2:59pm

@Wanl Please can you post the outputs of these commands

uname -a
cryosparcm cli "get_scheduler_targets()"
cat /proc/1/sched | head -n 1
nvidia-smi
cryosparcm eventlog P4 J21 | tail -n 40

Wanl · November 27, 2024, 2:45am

@wtempel Sure!

root@23ha96s68f61e-0:/wanl  uname -a
Linux 23ha96s68f61e-0 5.15.0-60-generic #66-Ubuntu SMP Fri Jan 20 14:29:49 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

root@23ha96s68f61e-0:/wanl  cryosparcm cli "get_scheduler_targets()"
[{'cache_path': '/home/cryo/scratch/cryosparc_cache', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 84987740160, 'name': 'NVIDIA A800 80GB PCIe MIG 7g.80gb'}], 'hostname': '23ha96s68f61e-0', 'lane': 'default', 'monitor_port': None, 'name': '23ha96s68f61e-0', 'resource_fixed': {'SSD': True}, 
'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223], 
'GPU': [0], 
'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192]},
'ssh_str': 'root@23ha96s68f61e-0', 'title': 'Worker node 23ha96s68f61e-0', 'type': 'node', 'worker_bin_path': '/home/cryo/cryosparc_worker/bin/cryosparcw'}]

root@23ha96s68f61e-0:/wanl  cat /proc/1/sched | head -n 1
tini (1, #threads: 1)

root@23ha96s68f61e-0:/wanl  nvidia-smi
Wed Nov 27 10:41:05 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A800 80GB PCIe          On  | 00000000:89:00.0 Off |                   On |
| N/A   30C    P0              44W / 300W |                  N/A |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  0    0   0   0  |               1MiB / 81050MiB  | 98      0 |  7   0    5    1    1 |
|                  |               1MiB / 131072MiB |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

root@23ha96s68f61e-0:/wanl   cryosparcm eventlog P4 J21 | tail -n 40
[Wed, 27 Nov 2024 02:29:59 GMT]  License is valid.
[Wed, 27 Nov 2024 02:29:59 GMT]  Launching job on lane default target 23ha96s68f61e-0 ...
[Wed, 27 Nov 2024 02:29:59 GMT]  Running job on master node hostname 23ha96s68f61e-0
[Wed, 27 Nov 2024 02:30:01 GMT] [CPU RAM used: 92 MB] Job J21 Started
[Wed, 27 Nov 2024 02:30:01 GMT] [CPU RAM used: 92 MB] Master running v4.5.3, worker running v4.5.3
[Wed, 27 Nov 2024 02:30:01 GMT] [CPU RAM used: 92 MB] Working in directory: /home/cryo/cryosparc_database/CS-p4/J21
[Wed, 27 Nov 2024 02:30:01 GMT] [CPU RAM used: 92 MB] Running on lane default
[Wed, 27 Nov 2024 02:30:01 GMT] [CPU RAM used: 92 MB] Resources allocated:
[Wed, 27 Nov 2024 02:30:01 GMT] [CPU RAM used: 92 MB]   Worker:  23ha96s68f61e-0
[Wed, 27 Nov 2024 02:30:01 GMT] [CPU RAM used: 92 MB]   CPU   :  [0]
[Wed, 27 Nov 2024 02:30:01 GMT] [CPU RAM used: 92 MB]   GPU   :  [0]
[Wed, 27 Nov 2024 02:30:01 GMT] [CPU RAM used: 92 MB]   RAM   :  [0]
[Wed, 27 Nov 2024 02:30:01 GMT] [CPU RAM used: 92 MB]   SSD   :  True
[Wed, 27 Nov 2024 02:30:01 GMT] [CPU RAM used: 92 MB] --------------------------------------------------------------
[Wed, 27 Nov 2024 02:30:01 GMT] [CPU RAM used: 92 MB] Importing job module for job type worker_gpu_test...
[Wed, 27 Nov 2024 02:30:05 GMT] [CPU RAM used: 223 MB] Job ready to run
[Wed, 27 Nov 2024 02:30:05 GMT] [CPU RAM used: 223 MB] ***************************************************************
[Wed, 27 Nov 2024 02:30:05 GMT] [CPU RAM used: 254 MB] Obtaining GPU info via `nvidia-smi`...
[Wed, 27 Nov 2024 02:30:05 GMT] [CPU RAM used: 254 MB] NVIDIA A800 80GB PCIe @ 00000000:89:00.0
[Wed, 27 Nov 2024 02:30:05 GMT] [CPU RAM used: 254 MB]     driver_version                :535.104.12
[Wed, 27 Nov 2024 02:30:05 GMT] [CPU RAM used: 254 MB]     persistence_mode              :Enabled
[Wed, 27 Nov 2024 02:30:05 GMT] [CPU RAM used: 254 MB]     power_limit                   :300.00
[Wed, 27 Nov 2024 02:30:05 GMT] [CPU RAM used: 254 MB]     sw_power_limit                :Not Active
[Wed, 27 Nov 2024 02:30:05 GMT] [CPU RAM used: 254 MB]     hw_power_limit                :Not Active
[Wed, 27 Nov 2024 02:30:05 GMT] [CPU RAM used: 254 MB]     compute_mode                  :Default
[Wed, 27 Nov 2024 02:30:05 GMT] [CPU RAM used: 254 MB]     max_pcie_link_gen             :4
[Wed, 27 Nov 2024 02:30:05 GMT] [CPU RAM used: 254 MB]     current_pcie_link_gen         :4
[Wed, 27 Nov 2024 02:30:05 GMT] [CPU RAM used: 254 MB]     temperature                   :30
[Wed, 27 Nov 2024 02:30:05 GMT] [CPU RAM used: 254 MB]     gpu_utilization               :[N/A]
[Wed, 27 Nov 2024 02:30:05 GMT] [CPU RAM used: 254 MB]     memory_utilization            :[N/A]
[Wed, 27 Nov 2024 02:31:42 GMT] [CPU RAM used: 171 MB] ====== Job process terminated abnormally.

Wanl · November 28, 2024, 4:10am

@wtempel hi, this problem has been troubling me up until now. Please do you have any suggestions for me?

wtempel · December 2, 2024, 10:04pm

@Wanl This configuration seems to involve a MIG-enabled GPU partition inside a container. We have not tested such a configuration ourselves and are unable to help in troubleshooting, unfortunately. Since the only configured MIG appears to encompass the entire device, it may be feasible to test if the device is supported after disabling MIG?

Wanl · December 3, 2024, 9:04am

Thank you for your suggestion！ It seems that this issue is caused by the MIG mode, but our cluster cannot stop MIG mode at the moment. If we make any progress on the issue in the future, we will provide updates in this thread.