Error with large box size refinement

olibclarke · July 18, 2024, 1:38pm

Tried running a NU-refinement with a large box size (1024px), it fails with the attached error (on an RTX-3090 card). Thoughts? I am running in low memory mode. It works when the particles are downsampled to 768px, so I assume it is a memory error of some kind, but it isn’t totally obvious to me looking at the error.

Cheers
Oli

Traceback (most recent call last):
  File "cryosparc_master/cryosparc_compute/run.py", line 115, in cryosparc_master.cryosparc_compute.run.main
  File "cryosparc_master/cryosparc_compute/jobs/refine/newrun.py", line 372, in cryosparc_master.cryosparc_compute.jobs.refine.newrun.run_homo_refine
  File "/home/exx/cryosparc/cryosparc_worker/cryosparc_compute/newfourier.py", line 417, in resample_resize_real
    return ZT( ifft( ZT(fft(x, stack=stack), N_resample, stack=stack), stack=stack), M, stack=stack), psize_final
  File "/home/exx/cryosparc/cryosparc_worker/cryosparc_compute/newfourier.py", line 122, in ifft
    return ifftcenter3(X, fft_threads)
  File "/home/exx/cryosparc/cryosparc_worker/cryosparc_compute/newfourier.py", line 95, in ifftcenter3
    v = fftmod.irfftn(tmp, threads=th)
  File "/home/exx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pyfftw/interfaces/numpy_fft.py", line 293, in irfftn
    return _Xfftn(a, s, axes, overwrite_input, planner_effort,
  File "/home/exx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pyfftw/interfaces/_utils.py", line 128, in _Xfftn
    FFTW_object = getattr(builders, calling_func)(*planner_args)
  File "/home/exx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pyfftw/builders/builders.py", line 543, in irfftn
    return _Xfftn(a, s, axes, overwrite_input, planner_effort,
  File "/home/exx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pyfftw/builders/_utils.py", line 260, in _Xfftn
    FFTW_object = pyfftw.FFTW(input_array, output_array, axes, direction,
  File "pyfftw/pyfftw.pyx", line 1223, in pyfftw.pyfftw.FFTW.__cinit__
ValueError: ('Strides of the output array must be less than ', '2147483647')

rbs_sci · July 18, 2024, 10:53pm

Definitely looks like a memory error from the pyfftw error.

Does dmesg say anything (need superuser permissions if running RedHat-based OS)? While GPU crashes in dmesg can be cryptic, if it’s specifically the GPU complaining it will confirm your hypothesis…

How many particles (GPU batch size) it is loading at a time?

1024 should be fine with LMM. Are you hitting Nyquist with 768? (Can try 960 pixels as still additive power of two…?)

olibclarke · July 19, 2024, 2:21am

Will check re dmesg - not sure re batch size as I didn’t alter the defaults for that and can’t see where it is specified in NU-refine. Total number of particles was 50k. Will try 960px, thx!

rbs_sci · July 19, 2024, 2:26am

I think easiest place to check GPU batch is during refinement, e.g.:

 [CPU:  64.71 GB  Avail: 676.77 GB] -- THR 0 BATCH 500 NUM 309500 TOTAL 4033.1170 ELAPSED 16030.820 --

Where “BATCH 500” indicates it’s feeding the GPU 500 particles a go?

50K certainly shouldn’t be an issue, was half expecting you to say 1M+

olibclarke · July 19, 2024, 2:31am

Didn’t even get that far!

rbs_sci · July 19, 2024, 2:35am

Oh, whoops! But the initial model looks good? No corrupt particles? (grasping at straws a little with it crashing so early, sorry)

What is is converting from and to (box, angpix)?

olibclarke · July 19, 2024, 2:49am

Initial model was an ab initio (4.95Å/pix, 256px), converting to 1024px, 0.825Å/pix - same input model works fine with the same particle set downsampled to 768px though, so I don’t think it is a corrupt particle issue…

rbs_sci · July 19, 2024, 3:03am

Could you resample it with Volume Tools, then use the result, see if that works?

The 768 pixel boxes are from downsampled 1024 pixel that crashes? Or independent extractions?

olibclarke · July 19, 2024, 3:15am

Will test resampling the volume first! Particles are downsampled versions of the set that crashed

AndreGraca · July 20, 2024, 9:52pm

Hi @olibclarke!

I have had this problem, exactly as you describe, since v4.4 came out… I thought it was a GPU VRAM limitation, but never understood why as I would expect the particle stack to fit well within 24GB.

Out of curiosity: which job did you use to extract the particles?

I have never managed to use NU after RBMC with box sizes above 1000px.

Cheers,
André

mmclean · July 31, 2024, 4:52pm

Hi @olibclarke,

I believe this may be occuring because CryoSPARC is attempting to resample and crop the input volume to match the pixel and box size of the particles. If I’m following correctly, the particles have spatial extent of (1024px * 0.825Å/pix) = 844.8 Å, but the volume has extent of (4.95Å/pix * 256px) = 1267.2 Å, which is bigger than the particles. The line where the job is failing is one where we first change the volume’s pixel size to that of the particles (0.825 Å) while keeping the physical extent constant (which necessitates increasing the box from 256 all the way to 6*256px = 1536px), and then subsequently cropping in real-space to 1024px. So in other words its briefly moving through a volume size of 1536px which appears too big for pyfftw.

There are two ways I think you could work around this –

Crop the input volume to a smaller spatial extent using Volume Tools. I think this would prevent moving through a larger box size of 1536.
- Based on our testing, you would have to crop the input volume to a box size of smaller than 1280 px after resampling to the particle pixel size, so here, smaller than 1280/6 = 212px.
Downsample the input particles (i.e. sample them at a larger pixel size). To a pixel size such that after resampling the volume to the input particle pixel size, it is at a box size of 1280 or less

Apologies for all of the numbers…
Could you let us know if either of these steps allows the job to progress further?

Best,
Michael

olibclarke · July 31, 2024, 5:40pm

Ah - thanks @mmclean, I never would have thought of that! Will test & report back

If this is the case - perhaps it might be worth making the log more granular - showing what box size it is resampling to for the conversion? If I had realized it was resampling to a 1536px box as part of the process, that definitely would have given me a hint as to what was going on