2D classification hangs in v4.6.0 with no error message

sjcalise · September 23, 2024, 7:05pm

Since updating to csparc 4.6.0, I’ve now seen twice that a 2D classification job will just never finish. There’s no error message, it just sits at the start of the final iteration (iteration 80 in this case) and doesn’t move - i.e. it’s been sitting there for 3 days, so it’s not just an issue of user impatience!

I’ve only seen this happen with 2D classification jobs. The first time it happened, I killed the job and cloned it, and it successfully ran. Has anyone else seen this with the new update? I don’t know if it’s something funky with csparc or our cluster, but that it seems specific to 2D classification makes me guess it’s some type of bug in csparc?

Edit: probably should have included this. This is what end of the job log looks like at the moment:
[CPU: 27.56 GB]

Done Full Iteration 79 took 396.605s for 160000 images
[CPU: 27.53 GB]

Outputting results…
[CPU: 27.54 GB]

Output particles to J40/J40_079_particles.cs
[CPU: 27.54 GB]

Output class averages to J40/J40_079_class_averages.cs, J40/J40_079_class_averages.mrc
[CPU: 27.54 GB]

Clearing previous iteration…
[CPU: 27.54 GB]

Deleting last_output_file_path_abs: /data/kollman/frames2/calise/CS-24sep10a-***/J40/J40_078_particles.cs
[CPU: 27.54 GB]

Deleting last_output_file_path_abs: /data/kollman/frames2/calise/CS-24sep10a-***/J40/J40_078_class_averages.cs
[CPU: 27.54 GB]

Deleting last_output_file_path_abs: /data/kollman/frames2/calise/CS-24sep10a-***/J40/J40_078_class_averages.mrc
[CPU: 27.54 GB]

Removed output results for P835 J40
[CPU: 27.56 GB]

Start of Iteration 80
[CPU: 27.56 GB]

– DEV 0 THR 0 NUM 683000 TOTAL 4812.6075 ELAPSED 7069.9769 –

hsnyder · September 23, 2024, 8:30pm

Hi @sjcalise,

Some users have reported that stalls like this can happen when transparent huge pages are enabled. You can check if transparent huge pages are enabled on your system by running

cat /sys/kernel/mm/transparent_hugepage/enabled

if the output doesn’t contain [never], including the square brackets around ‘never’, then you have transparent huge pages at least partially enabled. If that’s the case, I recommend disabling them. You can do so like this:

echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled

Please let us know if this resolves the issue for you, it’s helpful data for us.

–Harris

hsnyder · November 13, 2024, 4:16pm

Hi @sjcalise,

CryoSPARC v4.6.1, released today, configures Python’s numerical library (numpy) to not request huge pages from the operating system. We have found that this change resolves stalls related to transparent huge pages and it is therefore no longer necessary to turn off THP at the system level (leaving the setting at the default “madvise” should no longer cause problems). In v4.6.1, jobs will also emit a warning if the OS is set to “always” enable THP.

If you have already changed your OS configuration to disable THP, it is possible (though not necessary) to revert the OS configuration change after upgrade to v4.6.1.

–Harris

sjcalise · November 13, 2024, 6:38pm

Thanks Harris. Sorry I totally forgot to reply about this, guess it’s a moot point now!