3D Flex Training Frozen

Hello, I’m currently conducting 3D Flex training on H800 . However, the process has been stuck at iteration 7 for the past two days without generating any errors. I’m unsure about the next steps and would greatly appreciate your assistance.

Welcome to the forum @fxminato.
What is H800?
Please can you collect the following information on the worker computer where the job is running and post

  • the output of the command htop
  • the output of the command nvidia-smi -L
  • the output of the command free -g
  • the output of the command
    cat /sys/kernel/mm/transparent_hugepage/enabled
    
  • the output of the command uname -a

Thank you for your response. H800 refers to the NVIDIA H800 SXM5 . Our 3D-Flex training job ID is P2 J20, and although it may appear to be running smoothly as seen below, it has actually been stuck for more than a day.

bcl@bimsa-SYS-421GE-TNRT:~$ htop
 Avg[||||||||||||||||||                                                          20.6%]   Tasks: 148, 1025 thr; 34 running
  Mem[|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||13.9G/504G]   Load average: 34.37 33.63 33.28 
  Swp[|                                                                     7.50M/2.00G]   Uptime: 1 day, 02:35:08

    PID USER      PRI  NI  VIRT   RES   SHR S CPU%▽MEM%   TIME+  Command
   4720 bcl        20   0 66.2G 5787M  448M S 3201  1.1     840h python -c import cryosparc_compute.run as run; run.run() --project P2 --job J20 --master_hostname bimsa-SYS-421GE-TNR
   5189 bcl        20   0 66.2G 5787M  448M R 100.  1.1 26h15:51 python -c import cryosparc_compute.run as run; run.run() --project P2 --job J20 --master_hostname bimsa-SYS-421GE-TNR
   5198 bcl        20   0 66.2G 5787M  448M R 100.  1.1 26h15:51 python -c import cryosparc_compute.run as run; run.run() --project P2 --job J20 --master_hostname bimsa-SYS-421GE-TNR
   5219 bcl        20   0 66.2G 5787M  448M R 100.  1.1 26h15:51 python -c import cryosparc_compute.run as run; run.run() --project P2 --job J20 --master_hostname bimsa-SYS-421GE-TNR
   5240 bcl        20   0 66.2G 5787M  448M R 100.  1.1 26h15:51 python -c import cryosparc_compute.run as run; run.run() --project P2 --job J20 --master_hostname bimsa-SYS-421GE-TNR
   5246 bcl        20   0 66.2G 5787M  448M R 100.  1.1 26h15:51 python -c import cryosparc_compute.run as run; run.run() --project P2 --job J20 --master_hostname bimsa-SYS-421GE-TNR
   5248 bcl        20   0 66.2G 5787M  448M R 100.  1.1 26h15:50 python -c import cryosparc_compute.run as run; run.run() --project P2 --job J20 --master_hostname bimsa-SYS-421GE-TNR
   5259 bcl        20   0 66.2G 5787M  448M R 100.  1.1 26h15:51 python -c import cryosparc_compute.run as run; run.run() --project P2 --job J20 --master_hostname bimsa-SYS-421GE-TNR
   5140 bcl        20   0 66.2G 5787M  448M R 100.  1.1 26h15:51 python -c import cryosparc_compute.run as run; run.run() --project P2 --job J20 --master_hostname bimsa-SYS-421GE-TNR
   5143 bcl        20   0 66.2G 5787M  448M R 100.  1.1 26h15:51 python -c import cryosparc_compute.run as run; run.run() --project P2 --job J20 --master_hostname bimsa-SYS-421GE-TNR
   5155 bcl        20   0 66.2G 5787M  448M R 100.  1.1 26h15:51 python -c import cryosparc_compute.run as run; run.run() --project P2 --job J20 --master_hostname bimsa-SYS-421GE-TNR
   5166 bcl        20   0 66.2G 5787M  448M R 100.  1.1 26h15:51 python -c import cryosparc_compute.run as run; run.run() --project P2 --job J20 --master_hostname bimsa-SYS-421GE-TNR
   5173 bcl        20   0 66.2G 5787M  448M R 100.  1.1 26h15:51 python -c import cryosparc_compute.run as run; run.run() --project P2 --job J20 --master_hostname bimsa-SYS-421GE-TNR
   5179 bcl        20   0 66.2G 5787M  448M R 100.  1.1 26h15:50 python -c import cryosparc_compute.run as run; run.run() --project P2 --job J20 --master_hostname bimsa-SYS-421GE-TNR
   5183 bcl        20   0 66.2G 5787M  448M R 100.  1.1 26h15:51 python -c import cryosparc_compute.run as run; run.run() --project P2 --job J20 --master_hostname bimsa-SYS-421GE-TNR
F1Help  F2Setup F3SearchF4FilterF5Tree  F6SortByF7Nice -F8Nice +F9Kill  F10Quit


bcl@bimsa-SYS-421GE-TNRT:~$ nvidia-smi -L
GPU 0: NVIDIA H800 PCIe (UUID: GPU-8d55c01a-e897-ba49-47d0-b9dc2439a35b)
GPU 1: NVIDIA H800 PCIe (UUID: GPU-2361418d-0088-744e-341a-e6d6f669f1c0)
GPU 2: NVIDIA H800 PCIe (UUID: GPU-3a965ccf-016e-76a3-7ca3-dbb7ab817e39)
GPU 3: NVIDIA H800 PCIe (UUID: GPU-d7246aa3-6cd5-89ca-cff8-27f5751a2155)
GPU 4: NVIDIA H800 PCIe (UUID: GPU-1c77d065-aa3a-b7cb-97b8-31f14d2e9fca)
GPU 5: NVIDIA H800 PCIe (UUID: GPU-38614431-822e-bf14-9e5b-2f4c1310087d)
GPU 6: NVIDIA H800 PCIe (UUID: GPU-ef3fd2e8-0062-7c2a-329a-1fb08f31f2a3)
GPU 7: NVIDIA H800 PCIe (UUID: GPU-860e9bfc-e017-ae66-2ae2-c0da1f388254)

bcl@bimsa-SYS-421GE-TNRT:~$ free -g
               total        used        free      shared  buff/cache   available
Mem:             503          12           2           0         488         487
Swap:              1           0           1

bcl@bimsa-SYS-421GE-TNRT:~$ cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never

bcl@bimsa-SYS-421GE-TNRT:~$ uname -a
Linux bimsa-SYS-421GE-TNRT 6.2.0-36-generic #37~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Oct  9 15:34:04 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

Please can you try whether the command

sudo sh -c "echo never >/sys/kernel/mm/transparent_hugepage/enabled"

“revives” the job? (background).

Hi, we have updated CryoSPARC to v4.4 and the problem has been solved :smile: Thank you for replying~