Cryosparc v4.6.0 2D job never finish

Hi,
I’m running 2D jobs using the cryospacr V4.6.0. And the 2D job never finish it. It freeze in the last iteration without any error. I have tried many times. Somtimes it finished, sometime it just stuck in there.

image

Thanks,
Wendy

Hi @Wendy,

Could you please attempt the instructions posted in this thread?

Hi spunjani,

We have tried this, and I still got this problem. It seems to happen at randon.

Thanks,
Wendy

Just want to chime in to say we are seeing the same problem, even with transparent hugepages disabled - some 2D classification jobs will finish without issues, while most will stall at a seemingly random time, which is a behavior we’re only seeing since the update to 4.6.
The hardware this occurs on is both Zen2 and Zen3 arch CPUs (EPYC 7702P, EPYC 7713) and both Nvidia V100S and Nvidia A100.
It also seems to be independent of whether SSD caching is switched on or off.

Are more details we can provide to help you narrowing down the issue?

Kind Regards
René

Hi @wendy and @sittr,

I’d like to be absolutely sure that THP isn’t the culprit, since several other users (including us) have confirmed that this problem can be alleviated by disabling THP. Of course it’s possible there is a separate issue at play, but I’d like to be certain. Note that unless additional steps are taken, the THP enablement setting does not survive a reboot.

To that end, please run the following command on your worker nodes, and make sure that the worker node you check the setting on is the same one that you queue the test jobs to.

cat /sys/kernel/mm/transparent_hugepage/enabled

If transparent hugepages is disabled, the output will be this:

always madvise [never]

If the output is either of the following, THP is not fully disabled

always [madvise] never

or

[always] madvise never

Hi @hsnyder,
I can confirm that hugepages are disabled on all worker nodes of our system (hosts with ‘no route’ and ‘exit code 255’ are currently down for power saving):

# pdsh -g compute "cat /sys/kernel/mm/transparent_hugepage/enabled"
gpunode03: always madvise [never]
gpunode02: always madvise [never]
gpunode01: always madvise [never]
gpunode13: always madvise [never]
gpunode15: always madvise [never]
gpunode06: always madvise [never]
gpunode14: always madvise [never]
gpunode07: always madvise [never]
gpunode21: always madvise [never]
gpunode10: always madvise [never]
gpunode09: always madvise [never]
gpunode04: always madvise [never]
gpunode17: always madvise [never]
gpunode18: always madvise [never]
gpunode19: always madvise [never]
gpunode26: always madvise [never]
gpunode24: always madvise [never]
gpunode25: always madvise [never]
gpunode27: always madvise [never]
gpunode23: always madvise [never]
gpunode28: always madvise [never]
gpunode31: always madvise [never]
gpunode30: always madvise [never]
gpunode35: always madvise [never]
hmemnode05: always madvise [never]
hmemnode07: always madvise [never]
hmemnode08: always madvise [never]
gpunode37: always madvise [never]
gpunode38: always madvise [never]
gpunode33: always madvise [never]
hmemnode10: always madvise [never]
gpunode34: always madvise [never]
gpunode11: ssh: connect to host gpunode11 port 22: No route to host
gpunode08: ssh: connect to host gpunode08 port 22: No route to host
gpunode32: ssh: connect to host gpunode32 port 22: No route to host
gpunode22: ssh: connect to host gpunode22 port 22: No route to host
gpunode12: ssh: connect to host gpunode12 port 22: No route to host
gpunode29: ssh: connect to host gpunode29 port 22: No route to host
gpunode05: ssh: connect to host gpunode05 port 22: No route to host
gpunode20: ssh: connect to host gpunode20 port 22: No route to host
gpunode16: ssh: connect to host gpunode16 port 22: No route to host
pdsh@hnode01: gpunode11: ssh exited with exit code 255
pdsh@hnode01: gpunode22: ssh exited with exit code 255
pdsh@hnode01: gpunode32: ssh exited with exit code 255
pdsh@hnode01: gpunode29: ssh exited with exit code 255
pdsh@hnode01: gpunode08: ssh exited with exit code 255
pdsh@hnode01: gpunode16: ssh exited with exit code 255
pdsh@hnode01: gpunode20: ssh exited with exit code 255
pdsh@hnode01: gpunode05: ssh exited with exit code 255
pdsh@hnode01: gpunode12: ssh exited with exit code 255
gpunode36: ssh: connect to host gpunode36 port 22: No route to host
pdsh@hnode01: gpunode36: ssh exited with exit code 255
hmemnode03: ssh: connect to host hmemnode03 port 22: No route to host
hmemnode01: ssh: connect to host hmemnode01 port 22: No route to host
hmemnode02: ssh: connect to host hmemnode02 port 22: No route to host
hmemnode04: ssh: connect to host hmemnode04 port 22: No route to host
hmemnode06: ssh: connect to host hmemnode06 port 22: No route to host
pdsh@hnode01: hmemnode03: ssh exited with exit code 255
pdsh@hnode01: hmemnode02: ssh exited with exit code 255
pdsh@hnode01: hmemnode06: ssh exited with exit code 255
pdsh@hnode01: hmemnode04: ssh exited with exit code 255
pdsh@hnode01: hmemnode01: ssh exited with exit code 255
hmemnode11: ssh: connect to host hmemnode11 port 22: No route to host
hmemnode09: ssh: connect to host hmemnode09 port 22: No route to host
pdsh@hnode01: hmemnode11: ssh exited with exit code 255
pdsh@hnode01: hmemnode09: ssh exited with exit code 255

Additionally, I’ve seen this kind of stalling behaviour mainly in 2D Classification jobs, but also there are currently two 3D Classifications that show the same symptoms on our system. Whether it happens for a given job or not seems random, although I canot confirm whether restarting a stalling job has a chance of completing without stalling, or if certain jobs will reproduceably stall while others don’t.

I’ve scoured the job logs to no avail - there are a few warnings that mostly seem non-critical, and they usually occur before the stalled iteration routine starts - then, the log falls silent aside from heartbeat messages. Usually, these are the warning messages that are encountered before stalling (some may occur only once, some may repeat many times):

<string>:1: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
<string>:1: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current  use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
/home/cryosparkuser/cryosparc/cryosparc_worker/cryosparc_compute/util/logsumexp.py:41: RuntimeWarning: divide by zero encountered in log
  return n.log(wa * n.exp(a - vmax) + wb * n.exp(b - vmax) ) + vmax
<string>:1: RuntimeWarning: divide by zero encountered in log
<string>:1: RuntimeWarning: divide by zero encountered in true_divide
<string>:1: RuntimeWarning: invalid value encountered in multiply
<string>:1: RuntimeWarning: invalid value encountered in true_divide
<string>:1: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected

-René

Iam also seeing hanging 2D classification jobs with hugepages beeing disabled. Tried restarting it three times now to no help. Heartbeat is still normally sent.

Thanks @sittr and @KiSchnelle for confirming that.

Could you both please post some more information about your compute setup? What cluster workload manager are you using (if any), and what operating system?

Also, next time you encounter a stalled job, could you log into the node and check htop (or similar) and check the node activity? Specifically:

  • How much RAM is free
  • Is the stalled job using CPU? How much?
  • Is the kernel using a lot of CPU (red bar on htop)?

When we see a THP related stall, the job itself is basically idle and we see some kernel threads using 100% of several cores (i.e. red bars in htop). If instead we were to see the job itself taking up 100% of a core, or the job stalled with no significant CPU usage at all, those would imply a different kind of problem.

After checking the CPU and memory usage, try sending the stalled job a SIGABRT (kill -SIGABRT [pid]). This should crash the job. Once it’s stopped, check the bottom of the job log (the plain text log) for a traceback and some hexidecimal data. If you can post that, it might help us debug the issue.

– Harris

Hi Harris,

I confirmed the transparent hugepages is disabled. Some 2D jobs still hanged.

I think the memory of my computer is enough. Before I updated to V4.6.0, I can successfully finish the 2D job with more than 9 million particles. Now, I can not finish the 2D job even with 400k particles.

Thanks,
Wendy

Thanks for confirming, @wendy. If you’re able to perofrm the same checks that I mentioned in my previous post, that would be hepful. Specifically:

  • Check the CPU activity when a job stalls
  • Try sending SIGABRT to the stalled job and pasting the resulting log entry

Another thing which could be tried by anyone still experiencing this is to turn off THP-related memory defragmentation.

As root,

echo never > cat /sys/kernel/mm/transparent_hugepage/defrag

@hsnyder
No problem helping where i can.

Definityl i also think its a 4.6 bug. Also confirmed with another group that it appears after 4.6 update. Generally for me its all 2D classifiaction jobs, iteration number is totally random. So far only one managed to get to last iteration before stalling:D

We use slurm as workload manager, i now used a node which has 4 TB of RAM and allocated 100GB. The whole time the job was running more then enough RAM was free. OS is Ubuntu 22.04 LTS.

The stalling happens in the iteration step. Basically the increasing NUM and time just freezes. For example now:

[CPU:   6.55 GB]
Start of Iteration 6
[CPU:   6.55 GB]
-- DEV 0 THR 1 NUM 10000 TOTAL 41.969229 ELAPSED 113.89935 --
  • Job show 0% CPU in htop, no excessive red bars
  • Sending SIGABRT does not kill the job, tired with sending signal to with slurm and in htop
  • In HTOP no process seems to show as D.
    Trying some strace:
  • main is wait4(-1,
  • other child is futex(0x7f55b4255c50, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY
  • some other polling? pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=50000000}, NULL) = 0 (Timeout)
  • one is restart_syscall(<… resuming interrupted read …>
  • GPUs have some reserved memory(4/48GB) but are 0%

Iam now trying with exclusive node reservation.

cheers
Kilian

Hi @KiSchnelle,

Thanks, that’s very helpful information. Two questions about that strace that you did:

  • Are you sure it was the job’s “main process” and not the “monitor process”? The main process PID is printed near the start of the job’s text log, in a line like MAIN PROCESS PID 2247332.

  • Are the first arguments to the futex calls all the same long hexadecimal number, or are there several different ones?

Harris

@hsnyder

i didnt look at that i just thought its the one running cryosparcw and not python.

I may acutally have been wrong about the transparent_hugepage, the node i checked was off but i may just got ones with on.

But the funny thing is the exclusive job just finished without problems and that node has it definitly set to madvise and not never. Iam trying again another 2D with exclusive on the same node to see if i just was lucky. Edit1: The exclusive script also reserves just 1.5TB of RAM:D so i also have to test in a thrid run if the second also finished normal if the normal RAM reservation maybe is the problem with hugepage enabled.

Also i started on a node where i disabled definitly hugepages 2x 2D to see if one crashes. If it does i can tell you about the hex, else you were probably right from the start:) Not sure why it would happen for sittr then though and still pretty sure its something with 4.6.

cheers
Kilian

Hi @KiSchnelle,

No problem, let me know your results.

You might be able to modify your slurm script to also print out the state of transparent huge pages. That way you can be 100% sure that it is or is not off on a given run.

THP definitely doesn’t guarantee a lock-up. Sometimes jobs finish normally with it on.

4.6 seems to have made the THP problem worse, but some users, including us, did see it before that. The other reports in this thread imply that there’s a separate bug that also causes hangs, but I don’t yet have much of a lead on what the problem is, if there is indeed a separate problem.

To reiterate, for anyone else running into this issue, please try disabling both thp and defrag, on the node where the job runs

echo never > cat /sys/kernel/mm/transparent_hugepage/enabled
echo never > cat /sys/kernel/mm/transparent_hugepage/defrag

Sadly hugepages wasnt the culprit. Both 2D already are hanging while the exclusive on the node where it is madvise is still running happy:)

root@bert107:/home/ubuntu# cat /sys/kernel/mm/transparent_hugepage/defrag
always defer defer+madvise madvise [never]
root@bert107:/home/ubuntu# cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]

For the hex here are some, different.

root@bert107:/home/ubuntu# strace -p 985528
strace: Process 985528 attached
futex(0x55f3ef38e180, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY^Cstrace: Process 985528 detached
 <detached ...>

root@bert107:/home/ubuntu# strace -p 985529
strace: Process 985529 attached
futex(0x55f3eab6dd20, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY^Cstrace: Process 985529 detached
 <detached ...>

root@bert107:/home/ubuntu# strace -p 985530
strace: Process 985530 attached
futex(0x55f3e7c44760, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY^Cstrace: Process 985530 detached
 <detached ...>

for the main process i looked up the pid, it just repeats every few second this.

strace: Process 984854 attached
pselect6(0, NULL, NULL, NULL, {tv_sec=3, tv_nsec=920133630}, NULL) = 0 (Timeout)
getppid()                               = 984852
fstat(3, {st_mode=S_IFCHR|0666, st_rdev=makedev(0x1, 0x9), ...}) = 0
read(3, "\365j@\33~gg\231\377\334f\370k\2429\374", 16) = 16
newfstatat(AT_FDCWD, "/etc/nsswitch.conf", {st_mode=S_IFREG|0644, st_size=510, ...}, 0) = 0
newfstatat(AT_FDCWD, "/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=923, ...}, 0) = 0
openat(AT_FDCWD, "/etc/hosts", O_RDONLY|O_CLOEXEC) = 20
newfstatat(20, "", {st_mode=S_IFREG|0644, st_size=575, ...}, AT_EMPTY_PATH) = 0
lseek(20, 0, SEEK_SET)                  = 0
read(20, "# Your system has configured 'ma"..., 4096) = 575
read(20, "", 4096)                      = 0
close(20)                               = 0
socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_TCP) = 20
ioctl(20, FIONBIO, [1])                 = 0
connect(20, {sa_family=AF_INET, sin_port=htons(39002), sin_addr=inet_addr("192.168.120.2")}, 16) = -1 EINPROGRESS (Operation now in progress)
poll([{fd=20, events=POLLOUT|POLLERR}], 1, 300000) = 1 ([{fd=20, revents=POLLOUT}])
getsockopt(20, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
setsockopt(20, SOL_TCP, TCP_NODELAY, [1], 4) = 0
poll([{fd=20, events=POLLOUT}], 1, 300000) = 1 ([{fd=20, revents=POLLOUT}])
sendto(20, "POST /api HTTP/1.1\r\nAccept-Encod"..., 250, 0, NULL, 0) = 250
poll([{fd=20, events=POLLOUT}], 1, 300000) = 1 ([{fd=20, revents=POLLOUT}])
sendto(20, "{\"jsonrpc\": \"2.0\", \"method\": \"he"..., 115, 0, NULL, 0) = 115
poll([{fd=20, events=POLLIN}], 1, 300000) = 1 ([{fd=20, revents=POLLIN}])
recvfrom(20, "HTTP/1.1 200 OK\r\nServer: gunicor"..., 8192, 0, NULL, NULL) = 221
close(20)                               = 0
write(1, "========= sending heartbeat at 2"..., 58) = 58

Edit1:
Killing the main pid with SIGABRT will cause it to be a Zombie, childs still running. log showing:

Received SIGABRT (addr=00000000000f1092)
/home/cryosparcuser/cryosparc_worker/cryosparc_compute/ioengine/core.so(traceback_signal_handler+0x113)[0x7f64b188e953]
/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f64ba65b520]
/lib/x86_64-linux-gnu/libc.so.6(__select+0x15d)[0x7f64ba73463d]
python(+0x247baf)[0x55f3e3cb5baf]
/home/cryosparcuser/cryosparc_worker/cryosparc_compute/run.cpython-310-x86_64-linux-gnu.so(+0xd16d)[0x7f64b9e8016d]
/home/cryosparcuser/cryosparc_worker/cryosparc_compute/run.cpython-310-x86_64-linux-gnu.so(+0x1594e)[0x7f64b9e8894e]
python(_PyEval_EvalFrameDefault+0x4c12)[0x55f3e3ba7142]
python(+0x1d7c60)[0x55f3e3c45c60]
python(PyEval_EvalCode+0x87)[0x55f3e3c45ba7]
python(+0x20812a)[0x55f3e3c7612a]
python(+0x203523)[0x55f3e3c71523]
python(PyRun_StringFlags+0x7d)[0x55f3e3c6991d]
python(PyRun_SimpleStringFlags+0x3c)[0x55f3e3c6975c]
python(Py_RunMain+0x26b)[0x55f3e3c6866b]
python(Py_BytesMain+0x37)[0x55f3e3c391f7]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f64ba642d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f64ba642e40]
python(+0x1cb0f1)[0x55f3e3c390f1]
rax fffffffffffffffc  rbx 00007ffee4fcdb00  rcx 00007f64ba73463d  rdx 0000000000000000  
rsi 0000000000000000  rdi 0000000000000000  rbp 0000000000000000  rsp 00007ffee4fcda80  
r8  00007ffee4fcda90  r9  0000000000000000  r10 0000000000000000  r11 0000000000000246  
r12 0000000000000000  r13 00007ffee4fcda90  r14 0000000000000000  r15 0000000000000000  
5b 5d 41 5c 41 5d 41 5e 41 5f c3 0f 1f 84 00 00 00 00 00 4c 89 54 24 08 4c 89 04 24 e8
d2 53 f7 ff 4c 8b 04 24 45 31 c9 4c 89 fa 41 89 c6 4c 8b 54 24 08 48 89 ee 44 89 e7 b8
0e 01 00 00 0f 05
-->   48 3d 00 f0 ff ff 77 5b 41 89 c4 44 89 f7 e8 10 54 f7 ff e9 57 ff ff ff 0f 1f 00
45 31 ed 45 31 c0 e9 17 ff ff ff 0f 1f 44 00 00 69 c0 40 42 0f 00 49 8d 0c 30 29 c2 69
fa e8 03 00 00 e9 e9 fe

cheers
Kilian

Hi @KiSchnelle,

Hmmm those traces look very much like they’re the monitor process and not the main process… the monitor process is the one that sends the heartbeats. Could you paste the first 20 lines or so from the job text log here? Everything before the line

***************************************************************

Thanks

Hello,
looking at a stalled 3D Classification job right now:

  • Job has reserved 96GB RAM and is using 3.7 GB, so 92 GB are still free.
  • Generally 0% CPU usage. There are short periodic flashes where one process displays 0.7% usage, I presume that is when the heartbeat happens.
  • No intensive kernel CPU usage.

The Job log shows no info upon a SIGABRT, besides “no heartbeat recieved in 180 seconds”. The stderr output logs “ aborted (core dumped) ”; core dumping is switched off on our cluster, so I currently do not get a .core file.

@hsnyder you were right sorry was late in the evening misread monitor and main:D

The second exclusive job also finished without a problem, also started then two 2D again on the hugepages disabled node but increasing the memory of one of them to just 250GB but both are hanging again. Trying now a third exclusive, though iam not sure whats different then.
Here the hopefully now correct output of one of them:)

root@bert107:/home/ubuntu# more /sbdata/projects/XXX/job.log
ERROR: ld.so: object '/home/cryosparcuser/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/libpython3.10.so' 
from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object '/home/cryosparcuser/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/libpython3.10.so' 
from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object '/home/cryosparcuser/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/libpython3.10.so' 
from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object '/home/cryosparcuser/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/libpython3.10.so' 
from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object '/home/cryosparcuser/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/libpython3.10.so' 
from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object '/home/cryosparcuser/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/libpython3.10.so' 
from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object '/home/cryosparcuser/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/libpython3.10.so' 
from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.


================= CRYOSPARCW =======  2024-10-01 20:38:56.690192  =========
Project P86 Job J1522
Master telly102.maas Port 39002
===========================================================================
MAIN PROCESS PID 988583
========= now starting main process at 2024-10-01 20:38:56.691068
class2D.newrun cryosparc_compute.jobs.jobregister
/home/cryosparcuser/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numpy/core/getl
imits.py:499: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.
  setattr(self, word, getattr(machar, word).flat[0])
/home/cryosparcuser/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numpy/core/getl
imits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.
  return self._float_to_str(self.smallest_subnormal)
/home/cryosparcuser/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numpy/core/getl
imits.py:499: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
  setattr(self, word, getattr(machar, word).flat[0])
/home/cryosparcuser/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numpy/core/getl
imits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
  return self._float_to_str(self.smallest_subnormal)
MONITOR PROCESS PID 988585
========= monitor process now waiting for main process
========= sending heartbeat at 2024-10-01 20:38:57.996579
========= sending heartbeat at 2024-10-01 20:39:08.016903
========= sending heartbeat at 2024-10-01 20:39:18.035236
***************************************************************

and strace:

root@bert107:/home/ubuntu# strace -p 988583
strace: Process 988583 attached
futex(0x7fab001e4070, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY^Cstrace: Process 988583 detached
 <detached ...

after kiling: (kill -SIGABRT 988583, now all childs die)

Received SIGABRT (addr=00000000000f47ff)
/home/cryosparcuser/cryosparc_worker/cryosparc_compute/ioengine/core.so(traceback_signal_handler+0x113)[0x7faba30c3953]
/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fabac00a520]
/lib/x86_64-linux-gnu/libc.so.6(+0x91117)[0x7fabac059117]
/lib/x86_64-linux-gnu/libc.so.6(+0x9cc78)[0x7fabac064c78]
python(PyThread_acquire_lock_timed+0xc9)[0x55fd918c85d9]
python(+0x1de981)[0x55fd9198b981]
python(+0x1de79b)[0x55fd9198b79b]
python(+0x145ddd)[0x55fd918f2ddd]
python(_PyEval_EvalFrameDefault+0x72c)[0x55fd918e1c5c]
python(_PyFunction_Vectorcall+0x6c)[0x55fd918f1a2c]
python(_PyEval_EvalFrameDefault+0x72c)[0x55fd918e1c5c]
python(_PyFunction_Vectorcall+0x6c)[0x55fd918f1a2c]
/home/cryosparcuser/cryosparc_worker/cryosparc_compute/jobs/class2D/newrun.cpython-310-x86_64-linux-gnu.so(+0x43f0b)[0x7faba16b2f0b]
/home/cryosparcuser/cryosparc_worker/cryosparc_compute/run.cpython-310-x86_64-linux-gnu.so(+0xcd14)[0x7fabab82ed14]
/home/cryosparcuser/cryosparc_worker/cryosparc_compute/jobs/class2D/newrun.cpython-310-x86_64-linux-gnu.so(+0x160ce)[0x7faba16850ce]
/home/cryosparcuser/cryosparc_worker/cryosparc_compute/jobs/class2D/newrun.cpython-310-x86_64-linux-gnu.so(+0xa5ed8)[0x7faba1714ed8]
/home/cryosparcuser/cryosparc_worker/cryosparc_compute/run.cpython-310-x86_64-linux-gnu.so(+0x20e91)[0x7fabab842e91]
/home/cryosparcuser/cryosparc_worker/cryosparc_compute/run.cpython-310-x86_64-linux-gnu.so(+0x12c31)[0x7fabab834c31]
python(_PyEval_EvalFrameDefault+0x4c12)[0x55fd918e6142]
python(+0x1d7c60)[0x55fd91984c60]
python(PyEval_EvalCode+0x87)[0x55fd91984ba7]
python(+0x20812a)[0x55fd919b512a]
python(+0x203523)[0x55fd919b0523]
python(PyRun_StringFlags+0x7d)[0x55fd919a891d]
python(PyRun_SimpleStringFlags+0x3c)[0x55fd919a875c]
python(Py_RunMain+0x26b)[0x55fd919a766b]
python(Py_BytesMain+0x37)[0x55fd919781f7]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fababff1d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fababff1e40]
python(+0x1cb0f1)[0x55fd919780f1]
rax fffffffffffffffc  rbx 00007fab001e4070  rcx 00007fabac059117  rdx 0000000000000000  
rsi 0000000000000189  rdi 00007fab001e4070  rbp 0000000000000000  rsp 00007ffeec14e580  
r8  0000000000000000  r9  00000000ffffffff  r10 0000000000000000  r11 0000000000000246  
r12 0000000000000000  r13 0000000000000000  r14 0000000000000000  r15 0000000000000000  
00 00 5b 41 5c 41 5d c3 90 48 89 7c 24 10 89 74 24 0c 48 89 4c 24 18 e8 fd f8 ff ff 4c
8b 54 24 18 45 31 c0 44 89 ea 41 89 c4 8b 74 24 0c 48 8b 7c 24 10 41 b9 ff ff ff ff b8
ca 00 00 00 0f 05
-->   44 89 e7 48 89 c3 e8 3e f9 ff ff e9 62 ff ff ff 66 0f 1f 84 00 00 00 00 00 48 83
39 00 0f 89 0f ff ff ff b8 6e 00 00 00 e9 78 ff ff ff 66 66 2e 0f 1f 84 00 00 00 00 00
90 f3 0f 1e fa 41 89 f0

Thrid exclusive job also finished without a problem for me.

So so far for me:

  • only 2D affected
  • turning off hugepages has no effect
  • running in exclusive on node with hugepages in madvise finished 3 times in a row with no problem
    - exclusive job has more RAM allocated so tried that on normal job didnt work
    - exclusive job is taking 48 cpus-per-task instead of normal calculation like:
    #SBATCH --ntasks=1
    #SBATCH --cpus-per-task={{ num_cpu }} (2 for 2D (with 2 GPU))
  • every non-exclusive(other cryosparc jobs running on the node) job stalled so far

Iam trying now with normal cpu calculation on exclusive job.

Edit1: So the normal cpu calculation exculsive job finished also normally. I then also tried running exactly the same job that failed multiple times before on in non-exclusive, on the same node, with exclusive and it just finished totally normal. So for me it somehow looks like a problem when multiple cryosparc jobs are running on that node but only 2D is affected. If you have any other idea of test i can try. Maybe one of the others could confirm with exclusive running jobs for their hanging.