2D Classification: ValueError: index is out of bounds for array

mchakra · April 20, 2021, 11:43pm

I manually picked 200 particles and tried to run a 2D classification job, but the job fails before even a single iteration is completed. All parameters are set to their default, other than the request to classify into 10 classes. Below is the output:

As similar index is out of bounds GPU-related errors have been reported in other posts, I will state that the workstation being used is running CentOS7. Below is the ouput of nvidia-smi:

I also checked the output of cryosparcm joblog for the failed job, which is not very informative:

Two interesting observations:

In the same project where this failed job occurred, I ran two other 2D classification jobs using the output of blob picker (containing several hundred thousand particles). Both jobs completed successfully, although some runtime warnings were present in the joblog.
I tried to run this same 2D classification job multiple times, and the job does not always fail before the first iteration. In most cases it fails after iteration 1 with the above error, but in one instance it processed up to iteration 11 before failing with the above error.

Any advice on resolving this issue would be appreciated. Both Master and worker are running v3.2.0+210413.

stephan · April 21, 2021, 5:12pm

Hi @mchakra,

Thanks for reporting and providing all this information. We’d like to try to reproduce this ourselves on our systems- do you think you can share the 200 particles (based on the screenshot, they’re scattered across 23 MRC files) and their corresponding .cs file? You can get them by going to the Manual Picker job, navigating to the “Output” tab, and clicking on “Export” under the particles output result group. This will create a new folder inside the project folder with the particle MRC files, the micrograph MRC files and their corresponding .cs and .csg files. You’ll find the full path at the bottom of the “Overview” tab.
If you’re okay to share your data, let me know, and I’ll send you some details so you can get those over to me.

mchakra · April 21, 2021, 9:13pm

Hi @stephan,

Thank you for responding and offering to look into this. I think it should be find to share the data with you, as long as it is used only for error reproduction purposes. I was able to export the data into the folder as you described. Please let me know how I should send them to you.

stephan · April 22, 2021, 1:26pm

HI @mchakra,

Definitely, all data you choose to share with us will be kept confidential and used only for the purposes of reproducing this error in order to create a bug fix. I’ll send you a message with the credentials to our server to which you can SCP files to.

olibclarke · May 18, 2021, 10:45pm

I see the same error occasionally on my CentOS system. Just saw it on a Class2D job now, which died in iteration 4. The error in both the GUI log and the joblog look similar:

[CPU: 5.12 GB]   Traceback (most recent call last):
  File "/home/exx/cryosparc/cryosparc_worker/cryosparc_compute/jobs/runcommon.py", line 1791, in run_with_except_hook
    run_old(*args, **kw)
  File "cryosparc_worker/cryosparc_compute/engine/cuda_core.py", line 131, in cryosparc_compute.engine.cuda_core.GPUThread.run
  File "cryosparc_worker/cryosparc_compute/engine/cuda_core.py", line 132, in cryosparc_compute.engine.cuda_core.GPUThread.run
  File "cryosparc_worker/cryosparc_compute/engine/engine.py", line 1108, in cryosparc_compute.engine.engine.process.work
  File "cryosparc_worker/cryosparc_compute/engine/engine.py", line 389, in cryosparc_compute.engine.engine.EngineThread.find_and_set_best_pose_shift
  File "<__array_function__ internals>", line 6, in unravel_index
ValueError: index 1059285798 is out of bounds for array with size 336

If I rerun the same job, even with different parameters (e.g smaller mask), it dies at the same iteration:

[CPU: 5.24 GB]   Traceback (most recent call last):
  File "/home/exx/cryosparc/cryosparc_worker/cryosparc_compute/jobs/runcommon.py", line 1791, in run_with_except_hook
    run_old(*args, **kw)
  File "cryosparc_worker/cryosparc_compute/engine/cuda_core.py", line 131, in cryosparc_compute.engine.cuda_core.GPUThread.run
  File "cryosparc_worker/cryosparc_compute/engine/cuda_core.py", line 132, in cryosparc_compute.engine.cuda_core.GPUThread.run
  File "cryosparc_worker/cryosparc_compute/engine/engine.py", line 1108, in cryosparc_compute.engine.engine.process.work
  File "cryosparc_worker/cryosparc_compute/engine/engine.py", line 389, in cryosparc_compute.engine.engine.EngineThread.find_and_set_best_pose_shift
  File "<__array_function__ internals>", line 6, in unravel_index
ValueError: index -1087200183 is out of bounds for array with size 336

Perhaps notably, earlier in the log I see this:

[CPU: 2.66 GB]   Iteration 4
[CPU: 2.66 GB]     -- Effective number of classes per image: min nan | 25-pct nan | median nan | 75-pct nan | max nan 
[CPU: 2.66 GB]     -- Probability of best class per image: min nan | 25-pct nan | median nan | 75-pct nan | max nan

olibclarke · June 18, 2021, 4:26pm

we are still seeing this error intermittently - @stephan, can we provide any useful data to help fix?

stephan · June 18, 2021, 4:32pm

Hey @olibclarke,

Are you on the latest patch on v3.2.0? We fixed a bug caused by a similar reason, I wonder if it will help you here as well

olibclarke · June 18, 2021, 4:44pm

I’m on the next to latest… ok will update to the very latest and see if it fixes the issue, thx!

stephan · August 25, 2021, 4:47pm

Hey @olibclarke, @mchakra,

Have you experienced this issue while on the latest patch by any chance?

mchakra · August 25, 2021, 5:42pm

We are not on the latest patch, but currently using v3.2.0+210713, and do not seem to be encountering this issue.

lydko33 · August 29, 2021, 10:42am

Hi All,
I also encountered the same error during 2D classification (v3.2.0), which didn’t appear before (I am doing 2D classification on a dataset that used to work fine before). Any ideas what is the cause for this?
Best,
Daniel

zalkr · June 7, 2022, 9:27am

Is there an easy fix to this yet? I see a similar error even on a cloned 2D classification job that worked fine once. I am using v3.3.2. Thanx!

[CPU: 7.99 GB]   Traceback (most recent call last):
  File "/home/cryosparc/cryosparc_worker/cryosparc_compute/jobs/runcommon.py", line 1811, in run_with_except_hook
    run_old(*args, **kw)
  File "cryosparc_worker/cryosparc_compute/engine/cuda_core.py", line 131, in cryosparc_compute.engine.cuda_core.GPUThread.run
  File "cryosparc_worker/cryosparc_compute/engine/cuda_core.py", line 132, in cryosparc_compute.engine.cuda_core.GPUThread.run
  File "cryosparc_worker/cryosparc_compute/engine/engine.py", line 1109, in cryosparc_compute.engine.engine.process.work
  File "cryosparc_worker/cryosparc_compute/engine/engine.py", line 390, in cryosparc_compute.engine.engine.EngineThread.find_and_set_best_pose_shift
  File "<__array_function__ internals>", line 6, in unravel_index
ValueError: index -1110553443 is out of bounds for array with size 336

wtempel · June 7, 2022, 2:09pm

@zalkr Did the error occur at the very beginning of the job, or was there any indication of “normal” activity in the Overview tab before the error occurred?
It may be worth investigating if job input has been corrupted either on cache or persistent storage.
If you observe the error in a clone of this job even when Cache particle images on SSD is off, you may check for “particle” corruption with the Check For Corrupt Particles job, with Check for NaN values enabled.
For the latest version of that job type, please apply the 220518 patch to your instance.

jianhaoc · January 22, 2024, 2:32am

I guess some particles are at the edge of micrographs which likely cause the error. I wonder if cryosparc can exclude these particles?

wtempel · January 29, 2024, 10:06pm

@jianhaoc We are investigating the issue.

ASL · March 6, 2024, 8:58am

Hi all,
I am consistently seeing the same error when submitting 2D jobs - fails after 4-25 min. Any further insights?
/Anna

wtempel · March 6, 2024, 3:08pm

Welcome to the forum @ASL.
Please can you post

your CryoSPARC version
lines in the leading up to the traceback
the traceback
wether the error occurs when Cache particle images on SSD is disabled
whether you have run a Check For Corrupt Particles job, with Check for NaN values enabled.

stavros · March 12, 2024, 8:26am

Hi, I have been having recently this issue myself, initially thought it was caused by micrographs being assigned same UID and/or filenames for some reason, however I saw that sometimes 2D classifications can be restarted and might run to completion anyway and sometimes not.

I’m on v.4.4.1+240110, cache particle images on SSD is enabled, I have run multiple rounds of Check for Corrupt particles with and without Check for NaN values but job completed without finding any corrupt particles.

Traceback (most recent call last):
  File "/...../software/cryosparc/cryosparc_worker/cryosparc_compute/jobs/runcommon.py", line 2192, in run_with_except_hook
    run_old(*args, **kw)
  File "cryosparc_master/cryosparc_compute/gpu/gpucore.py", line 134, in cryosparc_master.cryosparc_compute.gpu.gpucore.GPUThread.run
  File "cryosparc_master/cryosparc_compute/gpu/gpucore.py", line 135, in cryosparc_master.cryosparc_compute.gpu.gpucore.GPUThread.run
  File "cryosparc_master/cryosparc_compute/jobs/class2D/newrun.py", line 632, in cryosparc_master.cryosparc_compute.jobs.class2D.newrun.class2D_engine_run.work
  File "cryosparc_master/cryosparc_compute/engine/newengine.py", line 1619, in cryosparc_master.cryosparc_compute.engine.newengine.EngineThread.find_best_pose_shift_class
  File "<__array_function__ internals>", line 5, in unravel_index
ValueError: index -1099971734 is out of bounds for array with size 336

Hope this helps! @wtempel

wtempel · March 12, 2024, 3:01pm

Thanks for this feedback.

Do you see

also when particle caching is disabled?

stavros · March 12, 2024, 4:31pm

Yes also the same error when caching is disabled!