Kernel panic - linux box with 2 GPUS

closed

#1

I recently installed cryoSPARC on a linux box with 2 GPUs for Dr Mohammad Mazhab-Jafari. This was purchased about a year ago and Dr. Jeff Lee has the same set up and has not experienced issues, although he said they haven’t run cryoSPARC that often on that computer.

We are having a hardware issue when we run cryoSPARC. We haven’t had issues before when running Relion, even when really pushing the performance with both GPUs at the same time.

To sum up, when running certain jobs that use the GPU, like Refinement, we the job stops and gives a heart beat error. Sometimes cryoSPARC stays running (as in cryosparcm stays running). However, very often the comptuer reboots. Out of 50 refinement jobs, we only got one to complete, so there is not always a problem. I don’t see much of anything when monitoring htop or nvidia-smi through ssh. It just hangs / breaks the ssh pipe when the computer reboots. However if we log on in text mode, the following error spontaneously appears on its own (see attached).

hardware error: CPU 5: machine check exception
Kernel panic - not syncing: Fatal machine check
Rebooting in 30 seconds

The screen shot says
Driver Version: 440.33.01
CUDA Version: 10.2

We changed and are still having the same problem with:
Driver Version: 410.48
V10.0.130

Does anyone have ideas for how to narrow down the problem? Should I run more cryoSPARC jobs and see which ones complete and which fail.

What can I do to diagnose the issue? Should I try combinations of Drivers and CUDA versions?


#2

Hey @Geoffrey!

What are the specs of the computer (OS, CPU, RAM, storage)?


#3

Ubuntu 18.04.2 LTS
64 GB RAM
24 CPUs
916 GB SSD ( at / and cryosparc scratch at /cryosparc2_scratch/)
Projects stored on 4 TB disk (/run/media/owner/Data1)

Verbose details below from

lsb_release -a
cat /proc/meminfo
cat /proc/cpuinfo
htop
df -h
owner@owner-System-Product-Name:~$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.2 LTS
Release:	18.04
Codename:	bionic

owner@owner-System-Product-Name:~$ cat /proc/meminfo
MemTotal:       65632888 kB
MemFree:        42605584 kB
MemAvailable:   64103204 kB
Buffers:          234780 kB
Cached:         21408976 kB
SwapCached:            0 kB
Active:           840528 kB
Inactive:       21193860 kB
Active(anon):     246884 kB
Inactive(anon):   146212 kB
Active(file):     593644 kB
Inactive(file): 21047648 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:       2097148 kB
SwapFree:        2097148 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:        390780 kB
Mapped:           181100 kB
Shmem:              2436 kB
Slab:             668332 kB
SReclaimable:     588772 kB
SUnreclaim:        79560 kB
KernelStack:        9152 kB
PageTables:        19012 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    34913592 kB
Committed_AS:    2401556 kB
VmallocTotal:   34359738367 kB
VmallocUsed:           0 kB
VmallocChunk:          0 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
CmaTotal:              0 kB
CmaFree:               0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:      638620 kB
DirectMap2M:    59840512 kB
DirectMap1G:     6291456 kB

owner@owner-System-Product-Name:~$ cat /proc/cpuinfo
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel(R) Core(TM) i9-7920X CPU @ 2.90GHz
stepping	: 4
microcode	: 0x2000064
cpu MHz		: 1200.039
cache size	: 16896 KB
physical id	: 0
siblings	: 24
core id		: 0
cpu cores	: 12
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 22
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req md_clear flush_l1d
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips	: 5800.00
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel(R) Core(TM) i9-7920X CPU @ 2.90GHz
stepping	: 4
microcode	: 0x2000064
cpu MHz		: 1200.011
cache size	: 16896 KB
physical id	: 0
siblings	: 24
core id		: 1
cpu cores	: 12
apicid		: 2
initial apicid	: 2
fpu		: yes
fpu_exception	: yes
cpuid level	: 22
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req md_clear flush_l1d
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips	: 5800.00
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

# TRUNCATED

processor	: 23
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel(R) Core(TM) i9-7920X CPU @ 2.90GHz
stepping	: 4
microcode	: 0x2000064
cpu MHz		: 1200.228
cache size	: 16896 KB
physical id	: 0
siblings	: 24
core id		: 13
cpu cores	: 12
apicid		: 27
initial apicid	: 27
fpu		: yes
fpu_exception	: yes
cpuid level	: 22
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req md_clear flush_l1d
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips	: 5800.00
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

owner@owner-System-Product-Name:~$ head -c -10 /tmp/htop.out  | tail -c +10
[22;0;0t

  1  [               0.0%]   7  [               0.0%]   13 [               0.0%]   19 [               0.0%]
  2  [               0.0%]   8  [               0.0%]   14 [               0.0%]   20 [               0.0%]
  3  [               0.0%]   9  [               0.0%]   15 [               0.0%]   21 [               0.0%]
  4  [               0.0%]   10 [               0.0%]   16 [               0.0%]   22 [               0.0%]
  5  [|||||||||||||100.0%]   11 [               0.0%]   17 [               0.0%]   23 [               0.0%]
  6  [               0.0%]   12 [               0.0%]   18 [               0.0%]   24 [               0.0%]
  Mem[||||||||||||||||||                  779M/62.6G]   Tasks: 83, 188 thr; 1 running
  Swp[                                      0K/2.00G]   Load average: 0.00 0.00 0.00
                                                        Uptime: 15:45:01

  PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
 9735 owner      20   0 40872  5028  3944 R 57.1  0.0  0:00.05 htop

owner@owner-System-Product-Name:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev             32G     0   32G   0% /dev
tmpfs           6.3G  2.3M  6.3G   1% /run
/dev/sdb2       916G  792G   78G  92% /
tmpfs            32G     0   32G   0% /dev/shm
tmpfs           5.0M  4.0K  5.0M   1% /run/lock
tmpfs            32G     0   32G   0% /sys/fs/cgroup
/dev/loop0       55M   55M     0 100% /snap/core18/1668
/dev/loop2       15M   15M     0 100% /snap/gnome-characters/375
/dev/loop1       55M   55M     0 100% /snap/core18/1650
/dev/loop3      4.3M  4.3M     0 100% /snap/gnome-calculator/536
/dev/loop4      161M  161M     0 100% /snap/gnome-3-28-1804/116
/dev/sdb1       511M  6.1M  505M   2% /boot/efi
/dev/loop5      141M  141M     0 100% /snap/gnome-3-26-1604/97
/dev/loop7      3.8M  3.8M     0 100% /snap/gnome-system-monitor/123
/dev/loop6       15M   15M     0 100% /snap/gnome-characters/399
/dev/loop8      157M  157M     0 100% /snap/gnome-3-28-1804/110
/dev/loop9       90M   90M     0 100% /snap/core/8268
/dev/loop10     141M  141M     0 100% /snap/gnome-3-26-1604/98
/dev/loop11      45M   45M     0 100% /snap/gtk-common-themes/1440
/dev/loop12     1.0M  1.0M     0 100% /snap/gnome-logs/81
/dev/loop13      45M   45M     0 100% /snap/gtk-common-themes/1353
/dev/loop14     3.8M  3.8M     0 100% /snap/gnome-system-monitor/127
/dev/loop15     1.0M  1.0M     0 100% /snap/gnome-logs/73
/dev/loop16     4.3M  4.3M     0 100% /snap/gnome-calculator/544
tmpfs           6.3G   16K  6.3G   1% /run/user/121
/dev/sda1       3.6T  2.1T  1.4T  60% /run/media/owner/Data1
/dev/loop18      92M   92M     0 100% /snap/core/8592
tmpfs           6.3G     0  6.3G   0% /run/user/1000


#4

Hey @Geoffrey,

Unfortunately we’ve never seen anything like this before (i.e. kernel panic) and generally speaking I don’t even know of a way that any of our code (eg. python) could actually cause a kernel panic without some other system or hardware issue. It’s strange that the issue only happens when cryoSPARC is running for sure, but there are still really no clues about the cause… sorry we can’t be more helpful!


#5

It turns out that the sister computer (identical spect in Jeff Lee’s lab) did have similar problems with the computer shutting off. Right now they have things working with

  • CUDA 10.2
  • NVIDIA 440.31
  • CryoSparc 2.13.2

I did some testing of jobs:

Driver Version: 410.48.
Cuda V10.0.130.
CryoSparc: v2.12.4

I could run Import Movies (72 movies)
And then with 1 movie I could run
Full-frame motion, CTF Estimation, Patch motion, Patch CTF, Blob picker, extract_micrographs (GPU and CPU), Ab-Initio
I ran two Ab-Initios at the same time (SSD and no SSD caching) and the one with no SSD caching failed at Iteration 200 with ====== Job process terminated abnormally.

A refinement failed during -- Iteration 0:

-- DEV 0 THR 0 NUM 103 TOTAL 0.1339330 ELAPSED 1.1286489 --

  Processed 205.000 images in 1.858s.

  Computing FSCs... 

Job is unresponsive - no heartbeat received in 30 seconds.

The job log of the ab initio that worked:

owner@owner-System-Product-Name:/run/media/owner/gw/P28$ more J33/job.log


================= CRYOSPARCW =======  2020-02-14 08:40:40.158212  =========
Project P3 Job J33
Master owner-System-Product-Name Port 39002
===========================================================================
========= monitor process now starting main process
MAINPROCESS PID 8689
========= monitor process now waiting for main process
MAIN PID 8689
abinit.run cryosparc2_compute.jobs.jobregister
***************************************************************
Running job  J33  of type  homo_abinit
Running job on hostname %s owner-System-Product-Name
Allocated Resources :  {u'lane': u'default', u'target': {u'monitor_port': None, u'lane': u'default', u'name': u
'owner-System-Product-Name', u'title': u'Worker node owner-System-Product-Name', u'resource_slots': {u'GPU': [0
, 1], u'RAM': [0, 1, 2, 3, 4, 5, 6, 7], u'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20, 21, 22, 23]}, u'hostname': u'owner-System-Product-Name', u'worker_bin_path': u'/home/owner/cryospar
c2a/cryosparc2_worker/bin/cryosparcw', u'cache_path': u'/home/owner/cryosparc2a/', u'cache_quota_mb': None, u'r
esource_fixed': {u'SSD': True}, u'cache_reserve_mb': 10000, u'type': u'node', u'ssh_str': u'owner@owner-System-
Product-Name', u'desc': None}, u'license': True, u'hostname': u'owner-System-Product-Name', u'slots': {u'GPU':
[0], u'RAM': [0], u'CPU': [0, 1]}, u'fixed': {u'SSD': True}, u'lane_type': u'default', u'licenses_acquired': 1}
cryosparc2_compute/jobs/runcommon.py:1490: RuntimeWarning: divide by zero encountered in float_scalars
  run_old(*args, **kw)
cryosparc2_compute/jobs/runcommon.py:1490: RuntimeWarning: divide by zero encountered in double_scalars
  run_old(*args, **kw)
cryosparc2_compute/jobs/runcommon.py:1490: RuntimeWarning: invalid value encountered in float_scalars
  run_old(*args, **kw)
cryosparc2_compute/jobs/runcommon.py:1490: RuntimeWarning: invalid value encountered in double_scalars
  run_old(*args, **kw)
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
/home/owner/cryosparc2a/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/matplotlib/pyplot.py:516: R
untimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib
.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning
, see the rcParam `figure.max_open_warning`).
  max_open_warning, RuntimeWarning)
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
***************************************************************
========= main process now complete.
========= monitor process now complete.

The one that didn’t work

owner@owner-System-Product-Name:/run/media/owner/gw/P28$ more J34/job.log


================= CRYOSPARCW =======  2020-02-14 08:41:11.079270  =========
Project P3 Job J34
Master owner-System-Product-Name Port 39002
===========================================================================
========= monitor process now starting main process
MAINPROCESS PID 8876
========= monitor process now waiting for main process
MAIN PID 8876
abinit.run cryosparc2_compute.jobs.jobregister
***************************************************************
Running job  J34  of type  homo_abinit
Running job on hostname %s owner-System-Product-Name
Allocated Resources :  {u'lane': u'default', u'target': {u'monitor_port': None, u'lane': u'default', u'name': u
'owner-System-Product-Name', u'title': u'Worker node owner-System-Product-Name', u'resource_slots': {u'GPU': [0
, 1], u'RAM': [0, 1, 2, 3, 4, 5, 6, 7], u'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20, 21, 22, 23]}, u'hostname': u'owner-System-Product-Name', u'worker_bin_path': u'/home/owner/cryospar
c2a/cryosparc2_worker/bin/cryosparcw', u'cache_path': u'/home/owner/cryosparc2a/', u'cache_quota_mb': None, u'r
esource_fixed': {u'SSD': True}, u'cache_reserve_mb': 10000, u'type': u'node', u'ssh_str': u'owner@owner-System-
Product-Name', u'desc': None}, u'license': True, u'hostname': u'owner-System-Product-Name', u'slots': {u'GPU':
[1], u'RAM': [1], u'CPU': [2, 3]}, u'fixed': {u'SSD': True}, u'lane_type': u'default', u'licenses_acquired': 1}
cryosparc2_compute/jobs/runcommon.py:1490: RuntimeWarning: invalid value encountered in float_scalars
  run_old(*args, **kw)
cryosparc2_compute/jobs/runcommon.py:1490: RuntimeWarning: invalid value encountered in double_scalars
  run_old(*args, **kw)
cryosparc2_compute/jobs/runcommon.py:1490: RuntimeWarning: divide by zero encountered in float_scalars
  run_old(*args, **kw)
cryosparc2_compute/jobs/runcommon.py:1490: RuntimeWarning: divide by zero encountered in double_scalars
  run_old(*args, **kw)
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
malloc(): memory corruption
malloc(): memory corruption
========= main process now complete.
========= monitor process now complete.

Then I updated the NVIDIA driver to the most recent one (440.59).
I kept CUDA 10.0
cryoSPARC 2.12.4

Full-frame motion (1 movie) completed but I could not get Ab initio to run. The job logs didn’t show anything, as if the computer restarted while everything was working fine

owner@owner-System-Product-Name:/run/media/owner/gw/P28$ more J45/job.log


================= CRYOSPARCW =======  2020-02-14 10:44:34.059865  =========
Project P3 Job J45
Master owner-System-Product-Name Port 39002
===========================================================================
========= monitor process now starting main process
MAINPROCESS PID 3422
========= monitor process now waiting for main process
MAIN PID 3422
abinit.run cryosparc2_compute.jobs.jobregister
***************************************************************
Running job  J45  of type  homo_abinit
Running job on hostname %s owner-System-Product-Name
Allocated Resources :  {u'lane': u'default', u'target': {u'monitor_port': None, u'lane': u'default', u'name': u
'owner-System-Product-Name', u'title': u'Worker node owner-System-Product-Name', u'resource_slots': {u'GPU': [0
, 1], u'RAM': [0, 1, 2, 3, 4, 5, 6, 7], u'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20, 21, 22, 23]}, u'hostname': u'owner-System-Product-Name', u'worker_bin_path': u'/home/owner/cryospar
c2a/cryosparc2_worker/bin/cryosparcw', u'cache_path': u'/home/owner/cryosparc2a/', u'cache_quota_mb': None, u'r
esource_fixed': {u'SSD': True}, u'cache_reserve_mb': 10000, u'type': u'node', u'ssh_str': u'owner@owner-System-
Product-Name', u'desc': None}, u'license': True, u'hostname': u'owner-System-Product-Name', u'slots': {u'GPU':
[0], u'RAM': [0], u'CPU': [0, 1]}, u'fixed': {u'SSD': True}, u'lane_type': u'default', u'licenses_acquired': 1}
cryosparc2_compute/jobs/runcommon.py:1490: RuntimeWarning: divide by zero encountered in float_scalars
  run_old(*args, **kw)
cryosparc2_compute/jobs/runcommon.py:1490: RuntimeWarning: divide by zero encountered in double_scalars
  run_old(*args, **kw)
cryosparc2_compute/jobs/runcommon.py:1490: RuntimeWarning: invalid value encountered in float_scalars
  run_old(*args, **kw)
cryosparc2_compute/jobs/runcommon.py:1490: RuntimeWarning: invalid value encountered in double_scalars
  run_old(*args, **kw)
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat

Then I updated to CUDA 10.2.
So now the driver is the most recent 440.59,
but there is still an older version of cryoSPARC v2.12

This didn’t change anything from CUDA 10.0. I could get Full-frame motion to complete, but not Ab-Initio. The computer rebooted in iteration 0. I think there was not even time to update the job.log since it seems truncated at

owner@owner-System-Product-Name:/run/media/owner/gw/P28$ more J47/job.log


================= CRYOSPARCW =======  2020-02-14 11:11:24.754086  =========
Project P3 Job J47
Master owner-System-Product-Name Port 39002
===========================================================================
========= monitor process now starting main process
MAINPROCESS PID 6524
========= monitor process now waiting for main process
MAIN PID 6524
abinit.run cryosparc2_compute.jobs.jobregister

I then updated cryoSPARC to v 2.13.2. So now
CUDA 10.2
NVIDIA DRIVER 440.59

I can complete Full-frame motion, Patch CTF, Blob picker, extract_micrographs (light load of just 1 exposure). However I’m getting some informative errors in Ab-Initio, 2D Class and Refinement

Ab initio

owner@owner-System-Product-Name:/run/media/owner/gw/P28$ more J49/job.log


================= CRYOSPARCW =======  2020-02-14 11:28:15.311179  =========
Project P3 Job J49
Master owner-System-Product-Name Port 39002
===========================================================================
========= monitor process now starting main process
MAINPROCESS PID 3329
========= monitor process now waiting for main process
MAIN PID 3329
abinit.run cryosparc2_compute.jobs.jobregister
/home/owner/cryosparc2a/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/skcuda/cublas.py:284: UserW
arning: creating CUBLAS context to get version number
  warnings.warn('creating CUBLAS context to get version number')
***************************************************************
Running job  J49  of type  homo_abinit
Running job on hostname %s owner-System-Product-Name
Allocated Resources :  {u'lane': u'default', u'target': {u'monitor_port': None, u'lane': u'default', u'name': u
'owner-System-Product-Name', u'title': u'Worker node owner-System-Product-Name', u'resource_slots': {u'GPU': [0
, 1], u'RAM': [0, 1, 2, 3, 4, 5, 6, 7], u'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20, 21, 22, 23]}, u'hostname': u'owner-System-Product-Name', u'worker_bin_path': u'/home/owner/cryospar
c2a/cryosparc2_worker/bin/cryosparcw', u'cache_path': u'/home/owner/cryosparc2a/', u'cache_quota_mb': None, u'r
esource_fixed': {u'SSD': True}, u'gpus': [{u'mem': 11554717696, u'id': 0, u'name': u'GeForce RTX 2080 Ti'}, {u'
mem': 11551440896, u'id': 1, u'name': u'GeForce RTX 2080 Ti'}], u'cache_reserve_mb': 10000, u'type': u'node', u
'ssh_str': u'owner@owner-System-Product-Name', u'desc': None}, u'license': True, u'hostname': u'owner-System-Pr
oduct-Name', u'slots': {u'GPU': [0], u'RAM': [0], u'CPU': [0, 1]}, u'fixed': {u'SSD': True}, u'lane_type': u'de
fault', u'licenses_acquired': 1}
**custom thread exception hook caught something
**** handle exception rc
set status to failed
Traceback (most recent call last):
  File "cryosparc2_compute/jobs/runcommon.py", line 1547, in run_with_except_hook
    run_old(*args, **kw)
  File "cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py", line 110, in cryosparc2_compute.engine.cuda_
core.GPUThread.run
  File "cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py", line 111, in cryosparc2_compute.engine.cuda_
core.GPUThread.run
  File "cryosparc2_worker/cryosparc2_compute/engine/engine.py", line 991, in cryosparc2_compute.engine.engine.p
rocess.work
  File "cryosparc2_worker/cryosparc2_compute/engine/engine.py", line 101, in cryosparc2_compute.engine.engine.E
ngineThread.load_image_data_gpu
  File "cryosparc2_worker/cryosparc2_compute/engine/cuda_kernels.py", line 1803, in cryosparc2_compute.engine.c
uda_kernels.prepare_real
  File "cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py", line 362, in cryosparc2_compute.engine.cuda_
core.context_dependent_memoize.wrapper
  File "cryosparc2_worker/cryosparc2_compute/engine/cuda_kernels.py", line 1707, in cryosparc2_compute.engine.c
uda_kernels.get_util_kernels
  File "/home/owner/cryosparc2a/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/pycuda/compiler.py"
, line 294, in __init__
    self.module = module_from_buffer(cubin)
pycuda._driver.LogicError: cuModuleLoadDataEx failed: device kernel image is invalid - error   : Binary format
for key='0', ident='' is not recognized
========= main process now complete.
========= monitor process now complete.

2D Class

owner@owner-System-Product-Name:/run/media/owner/gw/P28$ more J54/job.log


================= CRYOSPARCW =======  2020-02-14 11:32:04.829898  =========
Project P3 Job J54
Master owner-System-Product-Name Port 39002
===========================================================================
========= monitor process now starting main process
MAINPROCESS PID 3902
========= monitor process now waiting for main process
MAIN PID 3902
class2D.run cryosparc2_compute.jobs.jobregister
/home/owner/cryosparc2a/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/skcuda/cublas.py:284: UserW
arning: creating CUBLAS context to get version number
  warnings.warn('creating CUBLAS context to get version number')
***************************************************************
Running job  J54  of type  class_2D
Running job on hostname %s owner-System-Product-Name
Allocated Resources :  {u'lane': u'default', u'target': {u'monitor_port': None, u'lane': u'default', u'name': u
'owner-System-Product-Name', u'title': u'Worker node owner-System-Product-Name', u'resource_slots': {u'GPU': [0
, 1], u'RAM': [0, 1, 2, 3, 4, 5, 6, 7], u'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20, 21, 22, 23]}, u'hostname': u'owner-System-Product-Name', u'worker_bin_path': u'/home/owner/cryospar
c2a/cryosparc2_worker/bin/cryosparcw', u'cache_path': u'/home/owner/cryosparc2a/', u'cache_quota_mb': None, u'r
esource_fixed': {u'SSD': True}, u'gpus': [{u'mem': 11554717696, u'id': 0, u'name': u'GeForce RTX 2080 Ti'}, {u'
mem': 11551440896, u'id': 1, u'name': u'GeForce RTX 2080 Ti'}], u'cache_reserve_mb': 10000, u'type': u'node', u
'ssh_str': u'owner@owner-System-Product-Name', u'desc': None}, u'license': True, u'hostname': u'owner-System-Pr
oduct-Name', u'slots': {u'GPU': [0], u'RAM': [0, 1, 2], u'CPU': [0, 1]}, u'fixed': {u'SSD': True}, u'lane_type'
: u'default', u'licenses_acquired': 1}
**custom thread exception hook caught something
**** handle exception rc
set status to failed
Traceback (most recent call last):
  File "cryosparc2_compute/jobs/runcommon.py", line 1547, in run_with_except_hook
    run_old(*args, **kw)
  File "cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py", line 110, in cryosparc2_compute.engine.cuda_
core.GPUThread.run
  File "cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py", line 111, in cryosparc2_compute.engine.cuda_
core.GPUThread.run
  File "cryosparc2_worker/cryosparc2_compute/engine/engine.py", line 991, in cryosparc2_compute.engine.engine.p
rocess.work
  File "cryosparc2_worker/cryosparc2_compute/engine/engine.py", line 101, in cryosparc2_compute.engine.engine.E
ngineThread.load_image_data_gpu
  File "cryosparc2_worker/cryosparc2_compute/engine/cuda_kernels.py", line 1803, in cryosparc2_compute.engine.c
uda_kernels.prepare_real
  File "cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py", line 362, in cryosparc2_compute.engine.cuda_
core.context_dependent_memoize.wrapper
  File "cryosparc2_worker/cryosparc2_compute/engine/cuda_kernels.py", line 1707, in cryosparc2_compute.engine.c
uda_kernels.get_util_kernels
  File "/home/owner/cryosparc2a/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/pycuda/compiler.py"
, line 294, in __init__
    self.module = module_from_buffer(cubin)
pycuda._driver.LogicError: cuModuleLoadDataEx failed: device kernel image is invalid - error   : Binary format
for key='0', ident='' is not recognized
========= main process now complete.
========= monitor process now complete.

Refinement

owner@owner-System-Product-Name:/run/media/owner/gw/P28$ more J55/job.log


================= CRYOSPARCW =======  2020-02-14 11:56:22.342156  =========
Project P3 Job J55
Master owner-System-Product-Name Port 39002
===========================================================================
========= monitor process now starting main process
MAINPROCESS PID 4110
========= monitor process now waiting for main process
MAIN PID 4110
refine.run cryosparc2_compute.jobs.jobregister
/home/owner/cryosparc2a/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/skcuda/cublas.py:284: UserW
arning: creating CUBLAS context to get version number
  warnings.warn('creating CUBLAS context to get version number')
========= sending heartbeat
***************************************************************
Running job  J55  of type  homo_refine
Running job on hostname %s owner-System-Product-Name
Allocated Resources :  {u'lane': u'default', u'target': {u'monitor_port': None, u'lane': u'default', u'name': u
'owner-System-Product-Name', u'title': u'Worker node owner-System-Product-Name', u'resource_slots': {u'GPU': [0
, 1], u'RAM': [0, 1, 2, 3, 4, 5, 6, 7], u'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20, 21, 22, 23]}, u'hostname': u'owner-System-Product-Name', u'worker_bin_path': u'/home/owner/cryospar
c2a/cryosparc2_worker/bin/cryosparcw', u'cache_path': u'/home/owner/cryosparc2a/', u'cache_quota_mb': None, u'r
esource_fixed': {u'SSD': True}, u'gpus': [{u'mem': 11554717696, u'id': 0, u'name': u'GeForce RTX 2080 Ti'}, {u'
mem': 11551440896, u'id': 1, u'name': u'GeForce RTX 2080 Ti'}], u'cache_reserve_mb': 10000, u'type': u'node', u
'ssh_str': u'owner@owner-System-Product-Name', u'desc': None}, u'license': True, u'hostname': u'owner-System-Pr
oduct-Name', u'slots': {u'GPU': [0], u'RAM': [0, 1, 2], u'CPU': [0, 1, 2, 3]}, u'fixed': {u'SSD': True}, u'lane
_type': u'default', u'licenses_acquired': 1}
**custom thread exception hook caught something
**** handle exception rc
set status to failed
Traceback (most recent call last):
  File "cryosparc2_compute/jobs/runcommon.py", line 1547, in run_with_except_hook
    run_old(*args, **kw)
  File "cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py", line 110, in cryosparc2_compute.engine.cuda_
core.GPUThread.run
  File "cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py", line 111, in cryosparc2_compute.engine.cuda_
core.GPUThread.run
  File "cryosparc2_worker/cryosparc2_compute/engine/engine.py", line 991, in cryosparc2_compute.engine.engine.p
rocess.work
  File "cryosparc2_worker/cryosparc2_compute/engine/engine.py", line 101, in cryosparc2_compute.engine.engine.E
ngineThread.load_image_data_gpu
  File "cryosparc2_worker/cryosparc2_compute/engine/cuda_kernels.py", line 1803, in cryosparc2_compute.engine.c
uda_kernels.prepare_real
  File "cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py", line 362, in cryosparc2_compute.engine.cuda_
core.context_dependent_memoize.wrapper
  File "cryosparc2_worker/cryosparc2_compute/engine/cuda_kernels.py", line 1707, in cryosparc2_compute.engine.c
uda_kernels.get_util_kernels
  File "/home/owner/cryosparc2a/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/pycuda/compiler.py"
, line 294, in __init__
    self.module = module_from_buffer(cubin)
pycuda._driver.LogicError: cuModuleLoadDataEx failed: device kernel image is invalid - error   : Binary format
for key='0', ident='' is not recognized
========= main process now complete.
========= monitor process now complete.

#6

During this I installed pycuda manually a la Cannot recompile with new cuda after a previous "cryosparcw newcuda" run failed

For next steps, I was thinking of reinstalling cryosparc from scratch.


#7

Hi Geoffrey,
we experienced some problems with excessive RAM usage during refinements in the previous cryosparc version. Not as drastic, just no heartbeat and crash of the cryosparc instance. Might be worth checking your RAM during the run and see if it points you in that direction. Most recent cryosparc should improve RAM usage a lot.

Fingers crossed,
Dan


#8

I monitored RAM with htop and it wasn’t heavily used.


#9

What CPU and motherboard is it? We had the same mce/kernel panic reboots that after much diagnosis turned out to be due to some compatibility problems with Skylake-X processors and X299 chipsets that manifested when we ran cryosparc but were not caught with the vendor stress tests or when running other programs eg Relion. Ours was an i9-7920X and ASUS WS X299 SAGE/10G. Implementing the fix in the below link (change the AVX and AVX-512 offsets to -4 and -7 in the BIOS) has fixed the issue for us, we reported this back to the vendor and they are following up on it. Hope that helps.

https://software.intel.com/en-us/forums/intel-c-compiler/topic/779705


#10

Thanks for the specific advice!

Our CPU is Intel Core i9-7920X Processor
Our Motherboard is ASUS ROG RAMPAGE VI EXTREME Motherboard
I’m going to change AVX and AVX-512 offsets and see if that solves the issue.


#11

It looks like changing those offsets in the BIOS solved the problem.

We ran two simultaneous refinement jobs and they finished ok.

Thanks again @liamworrall :slight_smile:


2D Classification Memory Allocation - Unresponsive Job
#12

Great, sorry I didn’t see this earlier. Although ours was a vendor assembled workstation, they weren’t much help once all the obvious potential problems had been ruled out as we installed cryosparc ourselves and the issue only seemed to present when we ran it. It subsequently being a hardware issue after months of us troubleshooting was a frustrating experience so glad could help someone!


#13

Thanks all for solving this! Glad things are working :slight_smile: