Hi @Connor This does sound like a memory issue, possibly due to box size, movie frame count, or both. Some linux commands which may help with confirming that:
sudo journalctl | grep -i OOM
sudo dmesg | grep -i OOM
OOM refers to “out of memory” - Linux will kill processes that request more memory than is available. If you see OOM log entries from around the time your job failed, you can be pretty sure that’s what happened.
But I also noticed that the cryosparc log is reporting less than 250 GB RAM available when the job starts, and you’re saying the computer has 512 GB in total - suggesting that something else is also running on your server at the same time. I’d recommend:
- check that no other processes are running
- set the oversubscription threshold high (as suggested by @ccgauvin94)
- reduce the size of the RAM cache to 20 GB or so.
- use few GPUs
and see what happens.