Accumulation of errors in v3

Hi,

since about 2 weeks we run our jobs ob v3. and we noticed an accumulation of errors especially 3D work (ab-initio, refinements, 3D variability) dies at seemingly random time points in a non reproduceable fashion. a few examples:

one ab-initio job:

Traceback (most recent call last):
File “cryosparc_worker/cryosparc_compute/run.py”, line 84, in cryosparc_compute.run.main
File “cryosparc_worker/cryosparc_compute/jobs/abinit/run.py”, line 304, in cryosparc_compute.jobs.abinit.run.run_homo_abinit
File “cryosparc_worker/cryosparc_compute/engine/engine.py”, line 1119, in cryosparc_compute.engine.engine.process
File “cryosparc_worker/cryosparc_compute/engine/engine.py”, line 1120, in cryosparc_compute.engine.engine.process
File “cryosparc_worker/cryosparc_compute/engine/engine.py”, line 1078, in cryosparc_compute.engine.engine.process.work
File “cryosparc_worker/cryosparc_compute/engine/engine.py”, line 389, in cryosparc_compute.engine.engine.EngineThread.find_and_set_best_pose_shift
File “<array_function internals>”, line 6, in unravel_index
ValueError: index 1063733622 is out of bounds for array with size 960

ab-initio-job with the same inputs but happend at different iteration:
Traceback (most recent call last):
File “cryosparc_worker/cryosparc_compute/run.py”, line 84, in cryosparc_compute.run.main
File “cryosparc_worker/cryosparc_compute/jobs/abinit/run.py”, line 222, in cryosparc_compute.jobs.abinit.run.run_homo_abinit
File “/users/svc_cryosparc/software/regular/cryosparc2_worker/cryosparc_compute/sigproc.py”, line 428, in align_density
assert n.all(n.isfinite(M))
AssertionError

a 3D variability job:

Traceback (most recent call last):
  File "cryosparc_worker/cryosparc_compute/run.py", line 84, in cryosparc_compute.run.main
  File "cryosparc_worker/cryosparc_compute/jobs/var3D/run.py", line 524, in cryosparc_compute.jobs.var3D.run.run
  File "cryosparc_worker/cryosparc_compute/jobs/var3D/run.py", line 436, in cryosparc_compute.jobs.var3D.run.run.M_step
  File "<__array_function__ internals>", line 6, in eigvals
  File "/users/svc_cryosparc/software/regular/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/numpy/linalg/linalg.py", line 1063, in eigvals
    _assert_finite(a)
  File "/users/svc_cryosparc/software/regular/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/numpy/linalg/linalg.py", line 209, in _assert_finite
    raise LinAlgError("Array must not contain infs or NaNs")
numpy.linalg.LinAlgError: Array must not contain infs or NaNs

Other ab-initio run in other project:

Traceback (most recent call last):
File “cryosparc_worker/cryosparc_compute/run.py”, line 84, in cryosparc_compute.run.main
File “cryosparc_worker/cryosparc_compute/jobs/abinit/run.py”, line 304, in cryosparc_compute.jobs.abinit.run.run_homo_abinit
File “cryosparc_worker/cryosparc_compute/engine/engine.py”, line 1119, in cryosparc_compute.engine.engine.process
File “cryosparc_worker/cryosparc_compute/engine/engine.py”, line 1120, in cryosparc_compute.engine.engine.process
File “cryosparc_worker/cryosparc_compute/engine/engine.py”, line 1078, in cryosparc_compute.engine.engine.process.work
File “cryosparc_worker/cryosparc_compute/engine/engine.py”, line 389, in cryosparc_compute.engine.engine.EngineThread.find_and_set_best_pose_shift
File “<array_function internals>”, line 6, in unravel_index
ValueError: index -1085655912 is out of bounds for array with size 8064

The problems accumulate in multiple very different dataset since the update and ab-initio jobs seem to be especially vulnerable. Also old jobs that actually run through that I just clone get this issue now.

Any idea what the reason could be?

Best,

David

1 Like

Hey @david.haselbach,

Can you report the following:
lscpu && free -g && uname -a && nvidia-smi

Hi @david.haselbach,
We are investigating these errors - specifically for the errors in ab-initio, can you provide the streamlog of the job just before the error? there are some diagnostic values there that will help us as we haven’t been able to reproduce ourselves.

1 Like

THis is just above the error:

[CPU: 1.01 GB]   ----------- Iteration   534 (epoch 2.467).  radwn 32.22  resolution 20.49A  minisize  300  beta 0.00 

[CPU: 982.1 MB]     -- Class  0 -- lr: 0.20 eps:  0.13 step ratio : 0.1434 ESS R:  1.973 S:  1.874 Class Size: 100.0% (Average: 100.0%)

[CPU: 982.1 MB]    Done iteration 00534 of 01403 in  1.710s. Total time  690.3s. Est time remaining 1528.8s.

[CPU: 1021.9 MB] ----------- Iteration   535 (epoch 2.475).  radwn 32.26  resolution 20.46A  minisize  300  beta 0.00 

[CPU: 1.01 GB]      -- Class  0 -- lr: 0.20 eps:  0.13 step ratio : 0.1424 ESS R:  1.944 S:  1.903 Class Size: 100.0% (Average: 100.0%)

[CPU: 1.01 GB]     Done iteration 00535 of 01403 in  1.695s. Total time  692.0s. Est time remaining 1521.3s.

[CPU: 1.01 GB]   ----------- Iteration   536 (epoch 2.483).  radwn 32.30  resolution 20.44A  minisize  300  beta 0.00 

[CPU: 989.7 MB]     -- Class  0 -- lr: 0.20 eps:  0.13 step ratio : 0.1446 ESS R:  1.987 S:  1.992 Class Size: 100.0% (Average: 100.0%)

[CPU: 989.7 MB]    Done iteration 00536 of 01403 in  1.803s. Total time  693.8s. Est time remaining 1523.8s.

[CPU: 997.7 MB]  ----------- Iteration   537 (epoch 2.490).  radwn 32.34  resolution 20.41A  minisize  300  beta 0.00 

[CPU: 1005.6 MB]    -- Class  0 -- lr: 0.20 eps:  0.13 step ratio : 0.1435 ESS R:  2.053 S:  2.020 Class Size: 100.0% (Average: 100.0%)

[CPU: 1005.6 MB]   Done iteration 00537 of 01403 in  1.691s. Total time  695.5s. Est time remaining 1516.1s.

[CPU: 1.01 GB]   ----------- Iteration   538 (epoch 2.498).  radwn 32.38  resolution 20.38A  minisize  300  beta 0.00

Thanks, are you able to send your system info: lscpu && free -g && uname -a && nvidia-smi

I observe the same error in v3.2 ab initio:

Traceback (most recent call last):
  File "cryosparc_worker/cryosparc_compute/run.py", line 84, in cryosparc_compute.run.main
  File "cryosparc_worker/cryosparc_compute/jobs/abinit/run.py", line 222, in cryosparc_compute.jobs.abinit.run.run_homo_abinit
  File "/home/hiter/cryosparc/cryosparc_worker/cryosparc_compute/sigproc.py", line 453, in align_density
    assert n.all(n.isfinite(M))
AssertionError

And then the other error reported above for NU-refinement (New) on the same stack:

Traceback (most recent call last):
  File "/home/hiter/cryosparc/cryosparc_worker/cryosparc_compute/jobs/runcommon.py", line 1790, in run_with_except_hook
    run_old(*args, **kw)
  File "cryosparc_worker/cryosparc_compute/engine/cuda_core.py", line 131, in cryosparc_compute.engine.cuda_core.GPUThread.run
  File "cryosparc_worker/cryosparc_compute/engine/cuda_core.py", line 132, in cryosparc_compute.engine.cuda_core.GPUThread.run
  File "cryosparc_worker/cryosparc_compute/engine/engine.py", line 1108, in cryosparc_compute.engine.engine.process.work
  File "cryosparc_worker/cryosparc_compute/engine/engine.py", line 389, in cryosparc_compute.engine.engine.EngineThread.find_and_set_best_pose_shift
  File "<__array_function__ internals>", line 6, in unravel_index
ValueError: index 1041670348 is out of bounds for array with size 5376

Hi @navid-paknejad - we have several reports of users having errors like this on CentOS 7 operating system - are you on CentOS 7?
As far as we can tell there is some incompatibility between CentOS 7 and CUDA but we don’t yet have a way to work around it.

Can you also provide the output of nvidia-smi so we can see your CUDA and driver versions?

I can can confirm that I observe the same ab-initio error as navid-paknejad and david.haselbach above, using the latest patched version (v3.2.0+210511).

In the output, the values usually diverge to NaN before this error occurs.

[CPU: 798.2 MB]  ----------- Iteration    60 (epoch 0.196).  radwn 10.00  resolution 18.70A  minisize   90  beta 0.10 

[CPU: 758.2 MB]     -- Class  0 -- lr: 0.40 eps:  2.69 step ratio : 0.3609 ESS R:  1.000 S:  1.000 Class Size: 100.0% (Average: 100.0%)

[CPU: 758.2 MB]    Done iteration 00060 of 01585 in  0.539s. Total time   33.6s.

[CPU: 798.3 MB]  ----------- Iteration    61 (epoch 0.200).  radwn 10.00  resolution 18.70A  minisize   90  beta 0.10 

[CPU: 814.2 MB]     -- Class  0 -- lr: 0.40 eps:  2.56 step ratio : 0.2798 ESS R:  1.000 S:  1.000 Class Size: 100.0% (Average: 100.0%)

[CPU: 814.2 MB]    Done iteration 00061 of 01585 in  0.575s. Total time   34.2s.

[CPU: 814.2 MB]  ----------- Iteration    62 (epoch 0.203).  radwn 10.00  resolution 18.70A  minisize   90  beta 0.10 

[CPU: 774.2 MB]     -- Class  0 -- lr: 0.40 eps:  2.71 step ratio : 0.3187 ESS R:  1.000 S:  1.000 Class Size: 100.0% (Average: 100.0%)

[CPU: 774.2 MB]    Done iteration 00062 of 01585 in  0.540s. Total time   34.7s.

[CPU: 790.1 MB]  ----------- Iteration    63 (epoch 0.206).  radwn 10.00  resolution 18.70A  minisize   90  beta 0.10 

[CPU: 806.3 MB]     -- Class  0 -- lr: 0.40 eps:   inf step ratio :   nan ESS R:    nan S:    nan Class Size: nan% (Average: nan%)

[CPU: 806.3 MB]    Done iteration 00063 of 01585 in  0.460s. Total time   35.2s.

[CPU: 806.3 MB]  ----------- Iteration    64 (epoch 0.209).  radwn 10.00  resolution 18.70A  minisize   90  beta 0.10 

[CPU: 758.3 MB]     -- Class  0 -- lr: 0.40 eps:   inf step ratio :   nan ESS R:    nan S:    nan Class Size: nan% (Average: nan%)

[CPU: 758.3 MB]    Done iteration 00064 of 01585 in  0.472s. Total time   35.7s.

[CPU: 814.1 MB]  ----------- Iteration    98 (epoch 0.319).  radwn 10.00  resolution 18.70A  minisize   90  beta 0.10 

[CPU: 782.1 MB]     -- Class  0 -- lr: 0.40 eps:   inf step ratio :   nan ESS R:    nan S:    nan Class Size: nan% (Average: nan%)

[CPU: 782.1 MB]    Done iteration 00098 of 01585 in  0.433s. Total time   50.4s.

[CPU: 790.1 MB]  ----------- Iteration    99 (epoch 0.322).  radwn 10.00  resolution 18.70A  minisize   90  beta 0.10 

[CPU: 774.3 MB]     -- Class  0 -- lr: 0.40 eps:   inf step ratio :   nan ESS R:    nan S:    nan Class Size: nan% (Average: nan%)

[CPU: 774.3 MB]    Done iteration 00099 of 01585 in  0.427s. Total time   50.9s.

[CPU: 782.4 MB]  Traceback (most recent call last):
  File "cryosparc_worker/cryosparc_compute/run.py", line 84, in cryosparc_compute.run.main
  File "cryosparc_worker/cryosparc_compute/jobs/abinit/run.py", line 222, in cryosparc_compute.jobs.abinit.run.run_homo_abinit
  File "/home/cryosparc/local/cryosparc_worker/cryosparc_compute/sigproc.py", line 453, in align_density
    assert n.all(n.isfinite(M))
AssertionError

I am using CentOS 7, and the output of nvidia-smi is below.