Ab-initio error

Hi I’m getting this error when try to run ab-initio on NS data set:
Traceback (most recent call last):
File “cryosparc_master/cryosparc_compute/run.py”, line 115, in cryosparc_master.cryosparc_compute.run.main
File “cryosparc_master/cryosparc_compute/jobs/abinit/run.py”, line 316, in cryosparc_master.cryosparc_compute.jobs.abinit.run.run_homo_abinit
File “cryosparc_master/cryosparc_compute/engine/engine.py”, line 1194, in cryosparc_master.cryosparc_compute.engine.engine.process
File “cryosparc_master/cryosparc_compute/engine/engine.py”, line 1195, in cryosparc_master.cryosparc_compute.engine.engine.process
File “cryosparc_master/cryosparc_compute/engine/engine.py”, line 1136, in cryosparc_master.cryosparc_compute.engine.engine.process.work
File “cryosparc_master/cryosparc_compute/engine/engine.py”, line 421, in cryosparc_master.cryosparc_compute.engine.engine.EngineThread.compute_error
ValueError: Detected NaN values in engine.compute_error. 10445760 NaNs in total, 90 particles with NaNs.

I have run corrupt particle check on the same particles set and no corruption found.
The 2D look OK

Best,

@Elad What version of CryoSPARC do you use?

I’m using CryoSPARC 4.5.0

Thanks @Elad. Please can you post the output of the command

cryosparcm cli "get_job('P99', 'J199', 'job_type', 'version', 'instance_information', 'status',  'params_spec', 'errors_run')"

where P99 and J199 are replaced by the failed job’s project and job IDs, respectively.

{‘_id’: ‘66fd90e261e050ac966f3069’, ‘errors_run’: [{‘message’: ‘Detected NaN values in engine.compute_error. 10445760 NaNs in total, 90 particles with NaNs.’, ‘warning’: False}], ‘instance_information’: {‘CUDA_version’: ‘11.8’, ‘available_memory’: ‘493.12GB’, ‘cpu_model’: ‘Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz’, ‘driver_version’: ‘12.4’, ‘gpu_info’: [{‘id’: 0, ‘mem’: 11539054592, ‘name’: ‘NVIDIA GeForce RTX 2080 Ti’, ‘pcie’: ‘0000:02:00’}, {‘id’: 1, ‘mem’: 11539054592, ‘name’: ‘NVIDIA GeForce RTX 2080 Ti’, ‘pcie’: ‘0000:03:00’}, {‘id’: 2, ‘mem’: 11539054592, ‘name’: ‘NVIDIA GeForce RTX 2080 Ti’, ‘pcie’: ‘0000:83:00’}, {‘id’: 3, ‘mem’: 11539054592, ‘name’: ‘NVIDIA GeForce RTX 2080 Ti’, ‘pcie’: ‘0000:84:00’}], ‘ofd_hard_limit’: 4096, ‘ofd_soft_limit’: 1024, ‘physical_cores’: 36, ‘platform_architecture’: ‘x86_64’, ‘platform_node’: ‘mamba.csb.vanderbilt.edu’, ‘platform_release’: ‘3.10.0-1160.76.1.el7.x86_64’, ‘platform_version’: ‘#1 SMP Wed Aug 10 16:21:17 UTC 2022’, ‘total_memory’: ‘503.79GB’, ‘used_memory’: ‘9.20GB’}, ‘job_type’: ‘homo_abinit’, ‘params_spec’: {}, ‘project_uid’: ‘P31’, ‘status’: ‘failed’, ‘uid’: ‘J103’, ‘version’: ‘v4.5.0’}

Thanks @Elad for posting this information.
Does ValueError: Detected NaN values in engine.compute_error occur

  • every time you run a clone of the failed job
  • when inputting particles from a different extraction, downsampling or restacking job in the same project
  • in other projects

?
If possible, can you test if the error occurs after downgrading the CryoSPARC worker’s nvidia driver to version 525 and subsequently restarting the CryoSPARC worker.

*Yes, every time that I tried it failed

  • different extraction, from non NS flag micrographe, solved it.
  • it is work in other project

I have the same error when running ab-initio.

Traceback (most recent call last):
File “cryosparc_master/cryosparc_compute/run.py”, line 115, in cryosparc_master.cryosparc_compute.run.main
File “cryosparc_master/cryosparc_compute/jobs/abinit/run.py”, line 329, in cryosparc_master.cryosparc_compute.jobs.abinit.run.run_homo_abinit
File “cryosparc_master/cryosparc_compute/engine/engine.py”, line 1194, in cryosparc_master.cryosparc_compute.engine.engine.process
File “cryosparc_master/cryosparc_compute/engine/engine.py”, line 1195, in cryosparc_master.cryosparc_compute.engine.engine.process
File “cryosparc_master/cryosparc_compute/engine/engine.py”, line 1136, in cryosparc_master.cryosparc_compute.engine.engine.process.work
File “cryosparc_master/cryosparc_compute/engine/engine.py”, line 421, in cryosparc_master.cryosparc_compute.engine.engine.EngineThread.compute_error
ValueError: Detected NaN values in engine.compute_error. 52228800 NaNs in total, 90 particles with NaNs.

Thanks,
Wendy

Hi @Wendy,

Have you tried running the check for corrupt particles job? If you turn on the parameter “Check for NaN values”, it will report whether or not any particles on disk contain NaNs. I recommend running that job, just to see if there are in fact NaN values on-disk. The job doesn’t fail if there are NaN values, it just emits a warning, which can be easy to overlook. There’s also a summary printed near the end of the job’s stream log which will list all files that contained NaNs. The job will output a subset of the input particles that is chosen to avoid all the files that contain NaNs, which may allow you to continue processing the data by using the cleaned particle subset.

–Harris

Hi @Wendy,

Just following up regarding my previous post. Did you try the check for corrupt particles job, and if so what was the outcome. Your input would be valuable to us as we try to diagnose the cause of these NaN value errors.

–Harris

I have try it and no corrupt particles.
It was happened again in a different project.
In both cases it is a NS data set that the particles do not picked with the blob picked [picking mostly non particle] so I use the exp tool to switch to non-ns and then it pick well bur give this error.

Hi Harris,

I tried the check for corrupt particles job . No corruption detected( see attached picture).

I can resolve this issue by reselecting the 2D classes and using the reselected classes to run an ab initio job. This approach often solves the problem.

Thanks,
Wendy

1 Like

Hi,
I had the same error when running ab-initio with 4 different datasets.

In all 4 datasets, I restarted several times and all failed.
In all 4 datasets, I tried running check for corrupt particles job and no corruption detected.
In 1 dataset, I tried to input particles from a different extraction, downsampling (fourier crop from 2592 pix to 256 pix) , the same error.
I also tried to reselect the 2D classes as Wendy posted but same error occured.
Ab-initio worked in other datasets.
cryosparc version is v4.7.0
Thank you very much in advance!
Cheers,
Nolan

Welcome to the forum @nolanwang.
Please can you post as text the outputs of these commands

csprojectid="P99" # replace with actual project id
check_id="J199" # job ID of the relevant upstream check particles job
abinit_id="J200" # job ID of a failed ab initio job
cryosparcm cli "get_job('$csprojectid', '$check_id', 'type', 'version', 'params_spec', 'status')"
cryosparcm cli "get_job('$csprojectid', '$abinit_id', 'type', 'version', 'params_spec', 'status')"
cryosparcm eventlog $csprojectid $abinit_id | tail -n 50

Hi,
Thanks for the welcome! :smiling_face:
I have attached the output of the commands you posted below.
I also reran the job 1. without SSD cache and 2. change max res from 12(default) to 20 as this post suggested (Ab initio errors), but neither of them worked.
Here are the output of the commands:

> [cryosparc-user@app02 ~]$ cryosparcm cli "get_job('$csprojectid', '$check_id', 'type', 'version', 'params_spec', 'status')"
> {'_id': '685647109606e7e37e15ebc6', 'params_spec': {}, 'project_uid': 'P618', 'status': 'completed', 'type': 'check_corrupt_particles', 'uid': 'J80', 'version': 'v4.7.0'}
> [cryosparc-user@app02 ~]$ cryosparcm cli "get_job('$csprojectid', '$abinit_id', 'type', 'version', 'params_spec', 'status')"
> {'_id': '685a2d719606e7e37e4f7572', 'params_spec': {'abinit_K': {'value': 3}}, 'project_uid': 'P618', 'status': 'failed', 'type': 'homo_abinit', 'uid': 'J83', 'version': 'v4.7.0'

Here are the last 50 lines of failed eventlog:

> [CPU:   2.94 GB]
>   Done iteration 00049 of 02484 in 11.249s. Total time  637.3s.
> 
> [CPU:   3.07 GB]
> ----------- Iteration    50 (epoch 0.187).  radwn 50.87  resolution 35.00A  minisize   90  beta 0.10 
> 
> [CPU:   2.94 GB]
>    -- Class  0 -- lr: 0.40 eps:  9.52 step ratio : 2.4760 ESS R:  1.000 S:  1.000 Class Size: 57.5% (Average: 34.8%)
> 
> [CPU:   2.94 GB]
>    -- Class  1 -- lr: 0.40 eps:  9.52 step ratio : 1.1889 ESS R:  1.000 S:  1.000 Class Size: 36.8% (Average: 32.3%)
> 
> [CPU:   2.94 GB]
>    -- Class  2 -- lr: 0.40 eps:  9.52 step ratio : 0.1720 ESS R:  1.000 S:  1.000 Class Size: 4.0% (Average: 32.8%)
> 
> [CPU:   3.00 GB]
>   Done iteration 00050 of 02484 in 19.660s. Total time  657.0s.
> 
> [CPU:   3.13 GB]
> ----------- Iteration    51 (epoch 0.191).  radwn 50.87  resolution 35.00A  minisize   90  beta 0.10 
> 
> [CPU:   2.94 GB]
>    -- Class  0 -- lr: 0.40 eps:  7.12 step ratio : 2.6161 ESS R:  1.000 S:  1.000 Class Size: 82.2% (Average: 35.5%)
> 
> [CPU:   2.94 GB]
>    -- Class  1 -- lr: 0.40 eps:  7.12 step ratio : 0.4670 ESS R:  1.000 S:  1.000 Class Size: 13.3% (Average: 32.0%)
> 
> [CPU:   2.94 GB]
>    -- Class  2 -- lr: 0.40 eps:  7.12 step ratio : 0.9429 ESS R:  1.000 S:  1.000 Class Size: 36.8% (Average: 32.8%)
> 
> [CPU:   2.94 GB]
>   Done iteration 00051 of 02484 in 11.361s. Total time  668.3s.
> 
> [CPU:   3.07 GB]
> ----------- Iteration    52 (epoch 0.194).  radwn 50.87  resolution 35.00A  minisize   90  beta 0.10 
> 
> [CPU:   2.94 GB]
>    -- Class  0 -- lr: 0.40 eps:  2.15 step ratio : 0.8511 ESS R:  1.000 S:  1.000 Class Size: 90.0% (Average: 36.3%)
> 
> [CPU:   2.94 GB]
>    -- Class  1 -- lr: 0.40 eps:  2.15 step ratio : 0.1882 ESS R:  1.000 S:  1.000 Class Size: 14.6% (Average: 31.7%)
> 
> [CPU:   2.94 GB]
>    -- Class  2 -- lr: 0.40 eps:  2.15 step ratio : 3.8379 ESS R:  1.000 S:  1.000 Class Size: 286.4% (Average: 36.6%)
> 
> [CPU:   2.94 GB]
>   Done iteration 00052 of 02484 in 11.342s. Total time  679.7s.
> 
> [CPU:   3.07 GB]
> ----------- Iteration    53 (epoch 0.198).  radwn 50.87  resolution 35.00A  minisize   90  beta 0.10 
> 
> [CPU:   2.94 GB]
>    -- Class  0 -- lr: 0.40 eps:  3.68 step ratio : 1.1598 ESS R:  1.000 S:  1.000 Class Size: 91.1% (Average: 37.1%)
> 
> [CPU:   2.94 GB]
>    -- Class  1 -- lr: 0.40 eps:  3.68 step ratio : 0.6845 ESS R:  1.000 S:  1.000 Class Size: 36.8% (Average: 31.8%)
> 
> [CPU:   2.94 GB]
>    -- Class  2 -- lr: 0.40 eps:  3.68 step ratio : 2.3170 ESS R:  1.000 S:  1.000 Class Size: 114.8% (Average: 37.7%)
> 
> [CPU:   2.94 GB]
>   Done iteration 00053 of 02484 in 11.369s. Total time  691.1s.
> 
> [CPU:   3.07 GB]
> ----------- Iteration    54 (epoch 0.202).  radwn 50.87  resolution 35.00A  minisize   90  beta 0.10 
> 
> [CPU:   2.94 GB]
>    -- Class  0 -- lr: 0.40 eps:  0.00 step ratio : 4.9239 ESS R:  1.000 S:  1.000 Class Size: 9873537.8% (Average: 142480.6%)
> 
> [CPU:   2.94 GB]
>    -- Class  1 -- lr: 0.40 eps:  0.00 step ratio : 0.0075 ESS R:  1.000 S:  1.000 Class Size: 481.5% (Average: 38.3%)
> 
> [CPU:   2.94 GB]
>    -- Class  2 -- lr: 0.40 eps:  0.00 step ratio : 0.0176 ESS R:  1.000 S:  1.000 Class Size: 3378.9% (Average: 85.9%)
> 
> [CPU:   2.94 GB]
>   Done iteration 00054 of 02484 in 11.264s. Total time  702.3s.
> 
> [CPU:   3.07 GB]
> ----------- Iteration    55 (epoch 0.205).  radwn 50.87  resolution 35.00A  minisize   90  beta 0.10 
> 
> [CPU:   2.94 GB]
>    -- Class  0 -- lr: 0.40 eps:  0.03 step ratio : 0.0215 ESS R:  1.000 S:  1.000 Class Size: 98.9% (Average: 140453.7%)
> 
> [CPU:   2.94 GB]
>    -- Class  1 -- lr: 0.40 eps:  0.03 step ratio : 0.1078 ESS R:  1.000 S:  1.000 Class Size: 489.3% (Average: 44.7%)
> 
> [CPU:   2.94 GB]
>    -- Class  2 -- lr: 0.40 eps:  0.03 step ratio : 3.9793 ESS R:  1.000 S:  1.000 Class Size: 26556.3% (Average: 462.8%)
> 
> [CPU:   2.94 GB]
>   Done iteration 00055 of 02484 in 11.721s. Total time  714.0s.
> 
> [CPU:   3.07 GB]
> ----------- Iteration    56 (epoch 0.209).  radwn 50.87  resolution 35.00A  minisize   90  beta 0.10 
> 
> [CPU:   2.94 GB]
>    -- Class  0 -- lr: 0.40 eps:  4.02 step ratio : 1.8779 ESS R:  1.000 S:  1.000 Class Size: 100.0% (Average: 138481.8%)
> 
> [CPU:   2.94 GB]
>    -- Class  1 -- lr: 0.40 eps:  4.02 step ratio : 0.6134 ESS R:  1.000 S:  1.000 Class Size: 36.8% (Average: 44.6%)
> 
> [CPU:   2.94 GB]
>    -- Class  2 -- lr: 0.40 eps:  4.02 step ratio : 1.5176 ESS R:  1.000 S:  1.000 Class Size: 95.6% (Average: 457.6%)
> 
> [CPU:   2.94 GB]
>   Done iteration 00056 of 02484 in 11.189s. Total time  725.2s.
> 
> [CPU:   3.07 GB]
> ----------- Iteration    57 (epoch 0.213).  radwn 50.87  resolution 35.00A  minisize   90  beta 0.10 
> 
> [CPU:   2.94 GB]
>    -- Class  0 -- lr: 0.40 eps:   nan step ratio :   nan ESS R:  1.000 S:  1.000 Class Size: inf% (Average: inf%)
> 
> [CPU:   2.94 GB]
>    -- Class  1 -- lr: 0.40 eps:   nan step ratio :   nan ESS R:  1.000 S:  1.000 Class Size: 270.8% (Average: 47.7%)
> 
> [CPU:   2.94 GB]
>    -- Class  2 -- lr: 0.40 eps:   nan step ratio :   nan ESS R:  0.992 S:  0.994 Class Size: 87736618844160.0% (Average: 1216798458190.0%)
> 
> [CPU:   2.94 GB]
>   Done iteration 00057 of 02484 in 11.705s. Total time  736.9s.
> 
> [CPU:   3.07 GB]
> ----------- Iteration    58 (epoch 0.216).  radwn 50.87  resolution 35.00A  minisize   90  beta 0.10 
> 
> [CPU:  957.7 MB]
> Traceback (most recent call last):
>   File "cryosparc_master/cryosparc_compute/run.py", line 129, in cryosparc_master.cryosparc_compute.run.main
>   File "cryosparc_master/cryosparc_compute/jobs/abinit/run.py", line 330, in cryosparc_master.cryosparc_compute.jobs.abinit.run.run_homo_abinit
>   File "cryosparc_master/cryosparc_compute/engine/engine.py", line 1204, in cryosparc_master.cryosparc_compute.engine.engine.process
>   File "cryosparc_master/cryosparc_compute/engine/engine.py", line 1205, in cryosparc_master.cryosparc_compute.engine.engine.process
>   File "cryosparc_master/cryosparc_compute/engine/engine.py", line 1146, in cryosparc_master.cryosparc_compute.engine.engine.process.work
>   File "cryosparc_master/cryosparc_compute/engine/engine.py", line 431, in cryosparc_master.cryosparc_compute.engine.engine.EngineThread.compute_error
> ValueError: Detected NaN values in engine.compute_error. 31337280 NaNs in total, 90 particles with NaNs.

Thanks a lot for the quick response!
Best wishes,
Nolan