cryoSPARC v2.13.2 doesn't appear to work with the Nvidia 440 linux drivers?

open

#1

I recently ran updates on our Ubuntu 18.04.2 cryosparc worker nodes, which updated the Nvidia drivers from version 430 to version 440. Now cryosparc doesn’t work any more (with either the Cuda 9.1 or the Cuda 9.2 libraries). Jobs which ran prior to the update now fail with this error message:

[CPU: 244.37 GB] Traceback (most recent call last):
  File "cryosparc2_master/cryosparc2_compute/run.py", line 78, in cryosparc2_compute.run.main
  File "cryosparc2_compute/jobs/select2D/run.py", line 262, in run
    _, particles_to_plot = mrc.read_mrc(os.path.join(proj_dir_abs, particles_dset_exclude.data['blob/path'][0]))
  File "cryosparc2_compute/blobio/mrc.py", line 114, in read_mrc
    data = read_mrc_data(file_obj, header, start_page, end_page, out)
  File "cryosparc2_compute/blobio/mrc.py", line 77, in read_mrc_data
    data = n.fromfile(file_obj, dtype=dtype, count= num_pages * ny * nx).reshape(num_pages, ny, nx)
MemoryError

I’m going to try and revert the nvidia drivers, but am a loss for why an nvidia kernel driver updated would cause cryosparc to stop working.

In general, CryoSPARC seems to be very brittle with respect to Nvidia drivers/Cuda libraries. Can someone tell me which versions of the Nvidia drivers / Cuda libraries have been tested to work with CryoSPARC?


#2

It looks like the nvidia-440 drivers might need Cuda 10.2. Trying this now. In the past, cryoSPARC needed precisely the Cuda 9.1 drivers or wouldn’t run, but maybe this has been fixed.


#3

CryoSPARC doesn’t seem to work with the Cuda 10.2 libraries. In general, CryoSPARC seems to be very sensitive to specific Nvidia drivers/libraries.

Can someone tell me precisely what versions of the Nvidia drivers and Cuda libraries have been tested to work with CryoSPARC?


#4

I have cryosparc v12.3 with cuda 10.2 and the latest version of the nvidia drivers for GTX 1080Ti in two machines. They run fine with most of the Jobs but I am having problems with local refinement jobs that include a non-uniform refinement option.


#5

Hi @pgoetz,

It looks like your job is using almost 250GB of CPU memory, and your error message says “Memory Error” after a numpy function call. I think your issue may be related to CPU memory instead. What job are you running? Based on the logs, it looks like it’s a Select 2D job, which doesn’t require GPUs, meaning your recent CUDA change wouldn’t have an effect. It may be possible that your particle dataset is so huge it’s not fitting on memory. Can you also report how many particles you’re trying to select? If this is a large amount, you can use multiple Select 2D jobs connected to the 2D Classification job to select a few classes at a time.


#6

Hi sarulthasan -

How are you computing that we’re using 250GB of CPU memory, and presumably most of this is allocated or VIRT and not RSS. Both the CryoSPARC worker nodes are equipped with 128G of RAM, but only 64G of swap. I’ve asked one of the users to chime in (I’m the admin), but my understanding is jobs that ran last week or a couple of weeks ago on the same machine are now crashing with a MemoryError.

The only thing that’s changed is we upgraded to v2.13.2 a few days ago and I ran
apt dist-upgrade
on both the worker nodes in order to get the Nvidia driver versions synchronized.


#7

We determined that the issue is with v2.13.2. When we downgraded to v2.12.4 the Select 2D job finished without problems. The issue has nothing to do with the Nvidia drivers because we already reverted those back.

What changed with the Select 2D Classes job between v2.13.2 and v2.12.4? An important piece of data is that this is occurring only when we import particles output from relion_preprocess. This results in 1 star file and 1 large mrcs file containing all the particles. We sometimes need to do this however.

The error occurs after we select the 2D classes and hit ‘done’. The particles are never written out and the job ends up being terminated.

Any thoughts?
Jason


#8

Hi @Jason,

Thank you for reporting- @pgoetz, did you use a particle stack created by relion_preprocess as well when this error occurred?


#9

Hi sarulthasan -

Jason is the one creating the particle stack; I just run/debug the systems. I’m pretty sure the issue is with relion_preprocess accumulating all the images (439547 to be exact) into a single mrcs file. We don’t have this problem when working with multiple smaller mrcs files.