Thanks for your response. I installed the newest drivers and changed to the newest Cuda version. For a while, it ran okay, but occasionally it “chokes”.
lscpu && free -g && uname -a
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
On-line CPU(s) list: 0-63
Thread(s) per core: 2
Core(s) per socket: 32
NUMA node(s): 4
Vendor ID: AuthenticAMD
CPU family: 23
Model name: AMD Ryzen Threadripper 2990WX 32-Core Processor
CPU MHz: 1919.792
CPU max MHz: 3000.0000
CPU min MHz: 2200.0000
L1d cache: 32K
L1i cache: 64K
L2 cache: 512K
L3 cache: 8192K
NUMA node0 CPU(s): 0-7,32-39
NUMA node1 CPU(s): 16-23,48-55
NUMA node2 CPU(s): 8-15,40-47
NUMA node3 CPU(s): 24-31,56-63
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate sme ssbd sev ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca
total used free shared buff/cache available
Mem: 125 14 3 3 107 106
Swap: 3 3 0
Linux jptitan 4.18.0-193.19.1.el8_2.x86_64 #1 SMP Mon Sep 14 14:37:00 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
I also ran your script and checked with htop:
I don’t quite understand the 3 processes associated with this, as it is running on just 1 GPU (GPU 0).
Another thing that I did was to do what you suggested, get the job log.
I made a screenshot, since I cannot seem to find the button to attach text files:
It is very curious to me, since the job is running now, as we speak. Not sure where the time/date is coming from, the system clock is set correctly.
We have two machines where we only had Cuda11 installed from the beginning, and a new installation of CS3. There, everything is running smoothly.