Accumulation of errors in v3

Hi,

since about 2 weeks we run our jobs ob v3. and we noticed an accumulation of errors especially 3D work (ab-initio, refinements, 3D variability) dies at seemingly random time points in a non reproduceable fashion. a few examples:

one ab-initio job:

Traceback (most recent call last):
File “cryosparc_worker/cryosparc_compute/run.py”, line 84, in cryosparc_compute.run.main
File “cryosparc_worker/cryosparc_compute/jobs/abinit/run.py”, line 304, in cryosparc_compute.jobs.abinit.run.run_homo_abinit
File “cryosparc_worker/cryosparc_compute/engine/engine.py”, line 1119, in cryosparc_compute.engine.engine.process
File “cryosparc_worker/cryosparc_compute/engine/engine.py”, line 1120, in cryosparc_compute.engine.engine.process
File “cryosparc_worker/cryosparc_compute/engine/engine.py”, line 1078, in cryosparc_compute.engine.engine.process.work
File “cryosparc_worker/cryosparc_compute/engine/engine.py”, line 389, in cryosparc_compute.engine.engine.EngineThread.find_and_set_best_pose_shift
File “<array_function internals>”, line 6, in unravel_index
ValueError: index 1063733622 is out of bounds for array with size 960

ab-initio-job with the same inputs but happend at different iteration:
Traceback (most recent call last):
File “cryosparc_worker/cryosparc_compute/run.py”, line 84, in cryosparc_compute.run.main
File “cryosparc_worker/cryosparc_compute/jobs/abinit/run.py”, line 222, in cryosparc_compute.jobs.abinit.run.run_homo_abinit
File “/users/svc_cryosparc/software/regular/cryosparc2_worker/cryosparc_compute/sigproc.py”, line 428, in align_density
assert n.all(n.isfinite(M))
AssertionError

a 3D variability job:

Traceback (most recent call last):
  File "cryosparc_worker/cryosparc_compute/run.py", line 84, in cryosparc_compute.run.main
  File "cryosparc_worker/cryosparc_compute/jobs/var3D/run.py", line 524, in cryosparc_compute.jobs.var3D.run.run
  File "cryosparc_worker/cryosparc_compute/jobs/var3D/run.py", line 436, in cryosparc_compute.jobs.var3D.run.run.M_step
  File "<__array_function__ internals>", line 6, in eigvals
  File "/users/svc_cryosparc/software/regular/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/numpy/linalg/linalg.py", line 1063, in eigvals
    _assert_finite(a)
  File "/users/svc_cryosparc/software/regular/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/numpy/linalg/linalg.py", line 209, in _assert_finite
    raise LinAlgError("Array must not contain infs or NaNs")
numpy.linalg.LinAlgError: Array must not contain infs or NaNs

Other ab-initio run in other project:

Traceback (most recent call last):
File “cryosparc_worker/cryosparc_compute/run.py”, line 84, in cryosparc_compute.run.main
File “cryosparc_worker/cryosparc_compute/jobs/abinit/run.py”, line 304, in cryosparc_compute.jobs.abinit.run.run_homo_abinit
File “cryosparc_worker/cryosparc_compute/engine/engine.py”, line 1119, in cryosparc_compute.engine.engine.process
File “cryosparc_worker/cryosparc_compute/engine/engine.py”, line 1120, in cryosparc_compute.engine.engine.process
File “cryosparc_worker/cryosparc_compute/engine/engine.py”, line 1078, in cryosparc_compute.engine.engine.process.work
File “cryosparc_worker/cryosparc_compute/engine/engine.py”, line 389, in cryosparc_compute.engine.engine.EngineThread.find_and_set_best_pose_shift
File “<array_function internals>”, line 6, in unravel_index
ValueError: index -1085655912 is out of bounds for array with size 8064

The problems accumulate in multiple very different dataset since the update and ab-initio jobs seem to be especially vulnerable. Also old jobs that actually run through that I just clone get this issue now.

Any idea what the reason could be?

Best,

David

Hey @david.haselbach,

Can you report the following:
lscpu && free -g && uname -a && nvidia-smi

Hi @david.haselbach,
We are investigating these errors - specifically for the errors in ab-initio, can you provide the streamlog of the job just before the error? there are some diagnostic values there that will help us as we haven’t been able to reproduce ourselves.

THis is just above the error:

[CPU: 1.01 GB]   ----------- Iteration   534 (epoch 2.467).  radwn 32.22  resolution 20.49A  minisize  300  beta 0.00 

[CPU: 982.1 MB]     -- Class  0 -- lr: 0.20 eps:  0.13 step ratio : 0.1434 ESS R:  1.973 S:  1.874 Class Size: 100.0% (Average: 100.0%)

[CPU: 982.1 MB]    Done iteration 00534 of 01403 in  1.710s. Total time  690.3s. Est time remaining 1528.8s.

[CPU: 1021.9 MB] ----------- Iteration   535 (epoch 2.475).  radwn 32.26  resolution 20.46A  minisize  300  beta 0.00 

[CPU: 1.01 GB]      -- Class  0 -- lr: 0.20 eps:  0.13 step ratio : 0.1424 ESS R:  1.944 S:  1.903 Class Size: 100.0% (Average: 100.0%)

[CPU: 1.01 GB]     Done iteration 00535 of 01403 in  1.695s. Total time  692.0s. Est time remaining 1521.3s.

[CPU: 1.01 GB]   ----------- Iteration   536 (epoch 2.483).  radwn 32.30  resolution 20.44A  minisize  300  beta 0.00 

[CPU: 989.7 MB]     -- Class  0 -- lr: 0.20 eps:  0.13 step ratio : 0.1446 ESS R:  1.987 S:  1.992 Class Size: 100.0% (Average: 100.0%)

[CPU: 989.7 MB]    Done iteration 00536 of 01403 in  1.803s. Total time  693.8s. Est time remaining 1523.8s.

[CPU: 997.7 MB]  ----------- Iteration   537 (epoch 2.490).  radwn 32.34  resolution 20.41A  minisize  300  beta 0.00 

[CPU: 1005.6 MB]    -- Class  0 -- lr: 0.20 eps:  0.13 step ratio : 0.1435 ESS R:  2.053 S:  2.020 Class Size: 100.0% (Average: 100.0%)

[CPU: 1005.6 MB]   Done iteration 00537 of 01403 in  1.691s. Total time  695.5s. Est time remaining 1516.1s.

[CPU: 1.01 GB]   ----------- Iteration   538 (epoch 2.498).  radwn 32.38  resolution 20.38A  minisize  300  beta 0.00

Thanks, are you able to send your system info: lscpu && free -g && uname -a && nvidia-smi