PatchMotion failure 2.13.2

ElyseF · February 5, 2020, 2:57pm

Hello All,

I am experiencing this error message when trying to do patch motion with the tutorial dataset.

I have updated my Cuda environment to 9.1 and or 10.0.
I have tried to kill any ghost jobs with ps -ax | grep “supervisord”

Any help is appreciated.

[CPU: 86.2 MB]   --------------------------------------------------------------

[CPU: 86.2 MB]   Importing job module for job type patch_motion_correction_multi...

[CPU: 165.0 MB]  Job ready to run

[CPU: 165.0 MB]  ***************************************************************

[CPU: 165.3 MB]  Job will process this many movies:  20

[CPU: 165.3 MB]  parent process is 2778287

[CPU: 133.3 MB]  Calling CUDA init from 2778321

[CPU: 133.3 MB]  Calling CUDA init from 2778324

[CPU: 133.3 MB]  Calling CUDA init from 2778323

[CPU: 133.3 MB]  Calling CUDA init from 2778322

[CPU: 165.6 MB]  Outputting partial results now...

[CPU: 165.6 MB]  Traceback (most recent call last):
  File "cryosparc2_master/cryosparc2_compute/run.py", line 78, in cryosparc2_compute.run.main
  File "cryosparc2_master/cryosparc2_compute/jobs/motioncorrection/run_patch.py", line 349, in cryosparc2_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi
AssertionError: Child process with PID 2778321 has terminated unexpectedly!

apunjani · February 5, 2020, 7:53pm

Hi @ElyseF,

Can you try the following:

run the same job with only one GPU and see what error messages appear
run a “Rigid Motion Correction” job (not multi-GPU) and see if any errors show up

Unfortunately the true error message from the subprocess that is failing is not showing up in the patch-motion multi-GPU job.

MHB · February 19, 2020, 12:57am

I am seeing similar failure.

[CPU: 906.4 MB]  Traceback (most recent call last):
  File "cryosparc2_compute/jobs/runcommon.py", line 1547, in run_with_except_hook
    run_old(*args, **kw)
  File "/home/cryosparc_user/V2.X/cryosparc2_worker/deps/anaconda/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "cryosparc2_compute/jobs/pipeline.py", line 153, in thread_work
    work = processor.process(item)
  File "cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/run_patch.py", line 157, in cryosparc2_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/run_patch.py", line 160, in cryosparc2_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/run_patch.py", line 161, in cryosparc2_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/patchmotion.py", line 77, in cryosparc2_compute.jobs.motioncorrection.patchmotion.unbend_motion_correction
  File "cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/patchmotion.py", line 393, in cryosparc2_compute.jobs.motioncorrection.patchmotion.unbend_motion_correction
  File "cryosparc2_worker/cryosparc2_compute/engine/newgfourier.py", line 22, in cryosparc2_compute.engine.newgfourier.get_plan_R2C_2D
  File "/home/cryosparc_user/V2.X/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/skcuda/fft.py", line 127, in __init__
    onembed, ostride, odist, self.fft_type, self.batch)
  File "/home/cryosparc_user/V2.X/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/skcuda/cufft.py", line 742, in cufftMakePlanMany
    cufftCheckStatus(status)
  File "/home/cryosparc_user/V2.X/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/skcuda/cufft.py", line 117, in cufftCheckStatus
    raise e
cufftAllocFailed


[CPU: 196.2 MB]  Outputting partial results now...

[CPU: 181.1 MB]  Traceback (most recent call last):
  File "cryosparc2_worker/cryosparc2_compute/run.py", line 78, in cryosparc2_compute.run.main
  File "cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/run_patch.py", line 349, in cryosparc2_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi
AssertionError: Child process with PID 14219 has terminated unexpectedly!

MHB · February 19, 2020, 1:27am

Also motioncore2 works fine. Seems a memory issue as these are full pixel K3 images

liamworrall · February 19, 2020, 5:03pm

We’re also experiencing gpu memory problems with Patch Motion Correction in v2.13.2 on K3 super resolution movies, however, we also notice different success between a workstation with 4x2080Ti (fails immediately unless 1/4 cropped) and one with 3x1080Ti (can run without any cropping although fails if run on the same gpu as the X server is on). Data is super res with pixel size 0.826 (super res 0.413) Å/pixel. The workstations (centos 7) have different hardware but cuda, drivers, kernel are all the same. The only difference in cryosparc is the 1080Ti workstation has cryosparc live and the 2080Ti one not. It fails on the first job with the below error message. Can provide more information if needed. Thanks.

  File "cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/run_patch.py", line 157, in cryosparc2_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/run_patch.py", line 160, in cryosparc2_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/run_patch.py", line 161, in cryosparc2_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/patchmotion.py", line 77, in cryosparc2_compute.jobs.motioncorrection.patchmotion.unbend_motion_correction
  File "cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/patchmotion.py", line 446, in cryosparc2_compute.jobs.motioncorrection.patchmotion.unbend_motion_correction
  File "cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py", line 197, in cryosparc2_compute.engine.cuda_core.transfer_ndarray_to_cudaarray
MemoryError: cuArrayCreate failed: out of memory

apunjani · February 19, 2020, 7:09pm

Hi @MHB, @liamworrall,

We have some K3 test data for this issue and are trying to optimize the memory layout on GPU but it will be very helpful to have more data that is known to fail on 1080Ti/2080Ti.
Would either of you be able to share ~ a dozen troublesome movies?
I’ll email with upload instructions.

liamworrall · February 19, 2020, 7:22pm

We would be happy to share some movies.

MHB · February 19, 2020, 8:43pm

Yes happy to provide

MHB · February 20, 2020, 7:26pm

Did you get our uploads?

stephan · February 20, 2020, 8:35pm

Hi @MHB,

We received 12 files, thank you! Are you able to upload a file with microscope parameters (pixel size, total dose rate, accellerating voltage, spherical abberation mm)? Also, is a gain reference file necessary?

MHB · February 20, 2020, 8:49pm

Will do sorry. Should have included

MHB · February 22, 2020, 6:02pm

Gain and collection params loaded

marino-j · February 24, 2020, 7:49am

Hi, I have the same error with K3 super resolution movies…

Thanks for your help !

MHB · February 24, 2020, 7:28pm

Still occurs in 2.14.2

nfrasser · February 25, 2020, 8:40pm

We ran some tests with @MHB’s data on an 11GB 1080 Ti and measured a max memory requirement of around 10GB. We get the same allocation error if another process uses more than ~1.5 GB while patch motion runs.

Does that align with what everyone else is seeing? For anyone still experiencing this issue, can you post the output of nvidia-smi just before the Patch Motion job runs?

MHB · February 27, 2020, 8:57pm

Here is the info:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:05:00.0  On |                  N/A |
| 57%   63C    P8    13W / 250W |    313MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:06:00.0 Off |                  N/A |
| 57%   62C    P8    13W / 250W |      2MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  Off  | 00000000:09:00.0 Off |                  N/A |
| 61%   68C    P8    14W / 250W |      2MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 108...  Off  | 00000000:0A:00.0 Off |                  N/A |
| 52%   57C    P8    13W / 250W |      2MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     14399      G   /usr/bin/X                                   171MiB |
|    0     16512      G   /usr/bin/gnome-shell                         138MiB |
+-----------------------------------------------------------------------------+

nfrasser · March 2, 2020, 3:43pm

Hi @MHB, thanks for sending that over, that’ll really help us with debugging. I have a couple more requests:

Question: Are you running the job on a single workstation or a worker node?
Can you try running the job on a specific GPU and try one of GPU 1, 2 or 3 (not 0)? Do you still get the same error? Here’s how to do that

MHB · March 2, 2020, 7:10pm

I am running on a worker node. I get the same error if i run on GPU 0 or GPU 1. Also i cannot choose the GPU. I get this error with no option to choose the GPU.

“You need to select one more GPU”

nfrasser · March 2, 2020, 9:39pm

Looks like there’s something different in your 1080Ti GPU configuration compared to ours. Can you send us a full listing of your GPU information by running this shell command on the worker?

bash -c 'eval $(cryosparcw env) && python -c "import pycuda.driver as pycu; pycu.init(); print [(pycu.Device(i).name(), pycu.Device(i).compute_capability(), pycu.Device(i).total_memory(), pycu.Device(i).get_attributes()) for i in range(pycu.Device.count())]"'

Copy the full output and paste it here (it will not contain any personally identifiable information).

My apologies for all the back-and-forth, hopefully all this information will help us resolve the issue for you and all other cryoSPARC users experiencing this.

Nick

MHB · March 2, 2020, 10:11pm

pycuda.driver not in the cryosparc_user path so i get a no module error.