cuMemAlloc during patch motion correction

Hello,

During patch motion correction of movies I get a cuMemAlloc error. Interestingly, larger movies (i.e., same exact detector but twice as many frames) are able to correct on this system with no problem. I have tried using low memory mode and turning F-crop all the way to 1/4 to no avail. When running watch -n 1 nvidia-smi I see memory usage go to 6.5 GB, stay there for a few seconds, and then drop once the error occurs. On the larger movies which successfully correct, usage goes to 9.8 GB but is able to complete. Please see below for system specs and the full error message.

This node is running CentOS 7, CUDA 11.2, with 4 2080 Ti cards on driver 460.39. Another node also fails in the same way, and is using 4 2080 cards.

Movies:

$ header movie_00000.tif

 RO image file on unit   1 : movie_00000.tif     Size=     986236 K

                    This is a TIFF file (in strips of  11520 x      2).

 Number of columns, rows, sections .....   11520    8184      50
 Map mode ..............................    0   (byte)
 Start cols, rows, sects, grid x,y,z ...    0     0     0   11520   8184     50
 Pixel spacing (Angstroms)..............  0.9175     0.9175     0.9175
 Cell angles ...........................   90.000   90.000   90.000
 Fast, medium, slow axes ...............    X    Y    Z
 Origin on x,y,z .......................    0.000       0.000       0.000
 Minimum density .......................   0.0000
 Maximum density .......................   64.000
 Mean density ..........................   32.000
 tilt angles (original,current) ........   0.0   0.0   0.0   0.0   0.0   0.0
 Space group,# extra bytes,idtype,lens .        0        0        0        0

     2 Titles :
SerialEMCCD: Dose frac. image, scaled by 1.00  r/f 0
  SuperRef_movie_00000.dm4

Cryosparc:

$ cryosparcm status
----------------------------------------------------------------------------
CryoSPARC System master node installed at
/home/cryosparc/cryosparc2/cryosparc2_master
Current cryoSPARC version: v3.2.0
----------------------------------------------------------------------------

CryoSPARC process status:

app                              RUNNING   pid 18044, uptime 0:01:20
app_dev                          STOPPED   Not started
command_core                     RUNNING   pid 17888, uptime 0:01:30
command_rtp                      RUNNING   pid 17965, uptime 0:01:26
command_vis                      RUNNING   pid 17940, uptime 0:01:27
database                         RUNNING   pid 17802, uptime 0:01:32
liveapp                          RUNNING   pid 18075, uptime 0:01:18
liveapp_dev                      STOPPED   Not started
webapp                           RUNNING   pid 18011, uptime 0:01:22
webapp_dev                       STOPPED   Not started

----------------------------------------------------------------------------

global config variables:

export CRYOSPARC_LICENSE_ID="***"
export CRYOSPARC_MASTER_HOSTNAME="***"
export CRYOSPARC_DB_PATH="/home/cryosparc/cryosparc2/cryosparc2_database"
export CRYOSPARC_BASE_PORT=39000
export CRYOSPARC_DEVELOP=false
export CRYOSPARC_INSECURE=false

Error:

[CPU: 207.5 MB]  Error occurred while processing J1/imported/movie_00001.tif
Traceback (most recent call last):
  File "/home/cryosparc/cryosparc_worker/cryosparc_compute/jobs/pipeline.py", line 59, in exec
    return self.process(item)
  File "cryosparc_worker/cryosparc_compute/jobs/motioncorrection/run_patch.py", line 190, in cryosparc_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "cryosparc_worker/cryosparc_compute/jobs/motioncorrection/run_patch.py", line 193, in cryosparc_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "cryosparc_worker/cryosparc_compute/jobs/motioncorrection/run_patch.py", line 195, in cryosparc_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "cryosparc_worker/cryosparc_compute/jobs/motioncorrection/patchmotion.py", line 255, in cryosparc_compute.jobs.motioncorrection.patchmotion.unbend_motion_correction
  File "cryosparc_worker/cryosparc_compute/jobs/motioncorrection/patchmotion.py", line 496, in cryosparc_compute.jobs.motioncorrection.patchmotion.unbend_motion_correction
  File "cryosparc_worker/cryosparc_compute/engine/cuda_core.py", line 353, in cryosparc_compute.engine.cuda_core.EngineBaseThread.ensure_allocated
  File "/home/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/pycuda/gpuarray.py", line 210, in __init__
    self.gpudata = self.allocator(self.size * self.dtype.itemsize)
pycuda._driver.MemoryError: cuMemAlloc failed: out of memory

Marking J1/imported/movie_00001.tif as incomplete and continuing...

I also get a cuMemAlloc error during Blob Picking, which I have never seen before.

[CPU: 681.7 MB]  Traceback (most recent call last):
  File "cryosparc_worker/cryosparc_compute/run.py", line 84, in cryosparc_compute.run.main
  File "cryosparc_worker/cryosparc_compute/jobs/template_picker_gpu/run.py", line 61, in cryosparc_compute.jobs.template_picker_gpu.run.run
  File "cryosparc_worker/cryosparc_compute/jobs/template_picker_gpu/run.py", line 246, in cryosparc_compute.jobs.template_picker_gpu.run.do_pick
  File "cryosparc_worker/cryosparc_compute/jobs/template_picker_gpu/run.py", line 373, in cryosparc_compute.jobs.template_picker_gpu.run.do_pick
  File "/home/cryosparc/cryosparc_worker/cryosparc_compute/skcuda_internal/fft.py", line 134, in __init__
    onembed, ostride, odist, self.fft_type, self.batch)
  File "/home/cryosparc/cryosparc_worker/cryosparc_compute/skcuda_internal/cufft.py", line 749, in cufftMakePlanMany
    cufftCheckStatus(status)
  File "/home/cryosparc/cryosparc_worker/cryosparc_compute/skcuda_internal/cufft.py", line 124, in cufftCheckStatus
    raise e
cryosparc_compute.skcuda_internal.cufft.cufftAllocFailed

I successfully motion corrected using Motioncor2 and performed patch CTF estimation. Blog picking fails on the machine with 2080Tis (cuMemAlloc above) but succeeds on the machine with only 2080s, which have far less memory. Then, when attempting to extract the picked particles, I get a “Micrograph shapes are inconsistent” error.

[CPU: 1.69 GB]   Traceback (most recent call last):
  File "/home/cryosparc/cryosparc_worker/cryosparc_compute/jobs/runcommon.py", line 1790, in run_with_except_hook
    run_old(*args, **kw)
  File "/home/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home/cryosparc/cryosparc_worker/cryosparc_compute/jobs/pipeline.py", line 86, in stage_target
    work = processor.exec(item)
  File "/home/cryosparc/cryosparc_worker/cryosparc_compute/jobs/pipeline.py", line 43, in exec
    return self.process(item)
  File "/home/cryosparc/cryosparc_worker/cryosparc_compute/jobs/extract/run.py", line 469, in process
    update_alignments3D=update_alignments3D)
  File "/home/cryosparc/cryosparc_worker/cryosparc_compute/jobs/extract/extraction_gpu.py", line 49, in do_extract_particles_single_mic_gpu
    assert mic_shape == mic.shape, "Micrograph shapes are inconsistent!" # otherwise micrograph_blob/shape is inconsistent with the actual frame shape
AssertionError: Micrograph shapes are inconsistent!

Hi @posertinlab,

Regarding the most recent issue with the Motioncor2 workflow, Patch v3.2.0+210601 released yesterday should resolve the Micrograph shapes are inconsistent error: Patch 210601 is available for cryoSPARC v3.2.0

Best,
Michael

Ah, excellent, I’ll apply that patch. Didn’t notice since I rarely use Motioncor2 these days!

After re-running the motion correction onward, blob picking now fails on both machines.

Is there any chance there are other running processes on the GPU that might be consuming memory? Could you check the status of the GPUs using nvidia-smi?

Best,
Michael

2080 machine:

$ nvidia-smi
Wed Jun  2 12:02:49 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 2080    Off  | 00000000:05:00.0 Off |                  N/A |
| 22%   39C    P8     2W / 215W |     10MiB /  7982MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 2080    Off  | 00000000:06:00.0 Off |                  N/A |
| 21%   38C    P8     1W / 215W |     10MiB /  7982MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  GeForce RTX 2080    Off  | 00000000:09:00.0 Off |                  N/A |
| 21%   37C    P8    10W / 215W |     10MiB /  7982MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  GeForce RTX 2080    Off  | 00000000:0A:00.0 Off |                  N/A |
| 20%   34C    P8    19W / 215W |     20MiB /  7982MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      3322      G   /usr/bin/X                          5MiB |
|    1   N/A  N/A      3322      G   /usr/bin/X                          5MiB |
|    2   N/A  N/A      3322      G   /usr/bin/X                          5MiB |
|    3   N/A  N/A      3322      G   /usr/bin/X                          9MiB |
|    3   N/A  N/A      3411      G   /usr/bin/gnome-shell                6MiB |
+-----------------------------------------------------------------------------+

2080 Ti machine:

$ nvidia-smi
Wed Jun  2 11:57:21 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.39       Driver Version: 460.39       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:19:00.0 Off |                  N/A |
| 27%   31C    P8    12W / 250W |      3MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:1A:00.0 Off |                  N/A |
| 28%   34C    P8     4W / 250W |      3MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  GeForce RTX 208...  Off  | 00000000:67:00.0 Off |                  N/A |
| 28%   35C    P8    20W / 250W |      3MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  GeForce RTX 208...  Off  | 00000000:68:00.0 Off |                  N/A |
| 29%   40C    P8    23W / 250W |      3MiB / 11016MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I just rebooted the master node and the blob picking error has resolved itself. Tested the motion correction error and it remains.

Because it seems that cryosparc runs better on Ubuntu systems, I have since installed ubuntu 20.04.2 on the 2080 Ti system and upgraded to CUDA 11.3. This error persists.

Hi @posertinlab,

The amount of required memory for patch motion correction is dependent on magnification as well as on movie size. It’s possible for smaller movies to require more memory than larger ones if the pixel size is larger. What’s the pixel size in Angstroms of the failing movies?

Harris

Both movies have the same pixel size, 1.8 A/pix

Are there any ideas for how I might address this issue?

Actually, I may still not be understanding correctly — would the patch number explain failure on full-frame motion correction as well? Because in my hands, these movies fail in full-frame too, with cuMemAlloc out of memory errors.