Flex Train CUDA Error

leetleyang · December 16, 2022, 10:56am

Hi,

Installation of 3D-Flex dependencies was carried out with modification.
Worker nodes subsequently passed cryosparcm test workers P1 --test gpu --test-pytorch.

I’m now attempting to go through the 3D-Flex workflow with some test data (EMP-10342; 403,000 particles; 240p at 1.1Å/p) on a worker node equipped with 256GB RAM and GTX 1080 Ti GPUs (CUDA 11.7).

I am coming across the following Flex Train failure mode with a 120p training set (fourier cropped from 240p):

Traceback (most recent call last):
  File "cryosparc_master/cryosparc_compute/run.py", line 93, in cryosparc_compute.run.main
  File "cryosparc_master/cryosparc_compute/jobs/flex_refine/run_train.py", line 192, in cryosparc_compute.jobs.flex_refine.run_train.run
  File "cryosparc_master/cryosparc_compute/jobs/flex_refine/flexmod.py", line 1077, in cryosparc_compute.jobs.flex_refine.flexmod.run_flex_opt
  File "cryosparc_master/cryosparc_compute/jobs/flex_refine/flexmod.py", line 1171, in cryosparc_compute.jobs.flex_refine.flexmod.make_cp_plots
  File "cryosparc_master/cryosparc_compute/jobs/flex_refine/flexmod.py", line 1305, in cryosparc_compute.jobs.flex_refine.flexmod.get_flow
  File "/lmb/home/ylee/software/cryosparc2/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "cryosparc_master/cryosparc_compute/jobs/flex_refine/flexmod.py", line 507, in cryosparc_compute.jobs.flex_refine.flexmod.NNFlex3TM.forward
  File "/lmb/home/ylee/software/cryosparc2/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/lmb/home/ylee/software/cryosparc2/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

A more recent training run, cropping 240p-to-100p, seems to be running fine, stressing neither the system nor GPU memory much. I’m curious what this error message may indicate, or if it may be a consequence of our installation fix.

Cheers,
Yang