Select 2D fails when using particles from relion_preprocess

pgoetz · February 13, 2020, 12:01am

I recently ran updates on our Ubuntu 18.04.2 cryosparc worker nodes, which updated the Nvidia drivers from version 430 to version 440. Now cryosparc doesn’t work any more (with either the Cuda 9.1 or the Cuda 9.2 libraries). Jobs which ran prior to the update now fail with this error message:

[CPU: 244.37 GB] Traceback (most recent call last):
  File "cryosparc2_master/cryosparc2_compute/run.py", line 78, in cryosparc2_compute.run.main
  File "cryosparc2_compute/jobs/select2D/run.py", line 262, in run
    _, particles_to_plot = mrc.read_mrc(os.path.join(proj_dir_abs, particles_dset_exclude.data['blob/path'][0]))
  File "cryosparc2_compute/blobio/mrc.py", line 114, in read_mrc
    data = read_mrc_data(file_obj, header, start_page, end_page, out)
  File "cryosparc2_compute/blobio/mrc.py", line 77, in read_mrc_data
    data = n.fromfile(file_obj, dtype=dtype, count= num_pages * ny * nx).reshape(num_pages, ny, nx)
MemoryError

I’m going to try and revert the nvidia drivers, but am a loss for why an nvidia kernel driver updated would cause cryosparc to stop working.

In general, CryoSPARC seems to be very brittle with respect to Nvidia drivers/Cuda libraries. Can someone tell me which versions of the Nvidia drivers / Cuda libraries have been tested to work with CryoSPARC?

pgoetz · February 13, 2020, 3:45am

It looks like the nvidia-440 drivers might need Cuda 10.2. Trying this now. In the past, cryoSPARC needed precisely the Cuda 9.1 drivers or wouldn’t run, but maybe this has been fixed.

pgoetz · February 13, 2020, 12:56pm

CryoSPARC doesn’t seem to work with the Cuda 10.2 libraries. In general, CryoSPARC seems to be very sensitive to specific Nvidia drivers/libraries.

Can someone tell me precisely what versions of the Nvidia drivers and Cuda libraries have been tested to work with CryoSPARC?

crescalante · February 13, 2020, 2:37pm

I have cryosparc v12.3 with cuda 10.2 and the latest version of the nvidia drivers for GTX 1080Ti in two machines. They run fine with most of the Jobs but I am having problems with local refinement jobs that include a non-uniform refinement option.

stephan · February 13, 2020, 3:02pm

Hi @pgoetz,

It looks like your job is using almost 250GB of CPU memory, and your error message says “Memory Error” after a numpy function call. I think your issue may be related to CPU memory instead. What job are you running? Based on the logs, it looks like it’s a Select 2D job, which doesn’t require GPUs, meaning your recent CUDA change wouldn’t have an effect. It may be possible that your particle dataset is so huge it’s not fitting on memory. Can you also report how many particles you’re trying to select? If this is a large amount, you can use multiple Select 2D jobs connected to the 2D Classification job to select a few classes at a time.

pgoetz · February 13, 2020, 5:02pm

Hi sarulthasan -

How are you computing that we’re using 250GB of CPU memory, and presumably most of this is allocated or VIRT and not RSS. Both the CryoSPARC worker nodes are equipped with 128G of RAM, but only 64G of swap. I’ve asked one of the users to chime in (I’m the admin), but my understanding is jobs that ran last week or a couple of weeks ago on the same machine are now crashing with a MemoryError.

The only thing that’s changed is we upgraded to v2.13.2 a few days ago and I ran
apt dist-upgrade
on both the worker nodes in order to get the Nvidia driver versions synchronized.

Jason · February 13, 2020, 9:23pm

We determined that the issue is with v2.13.2. When we downgraded to v2.12.4 the Select 2D job finished without problems. The issue has nothing to do with the Nvidia drivers because we already reverted those back.

What changed with the Select 2D Classes job between v2.13.2 and v2.12.4? An important piece of data is that this is occurring only when we import particles output from relion_preprocess. This results in 1 star file and 1 large mrcs file containing all the particles. We sometimes need to do this however.

The error occurs after we select the 2D classes and hit ‘done’. The particles are never written out and the job ends up being terminated.

Any thoughts?
Jason

stephan · February 14, 2020, 6:53pm

Hi @Jason,

Thank you for reporting- @pgoetz, did you use a particle stack created by relion_preprocess as well when this error occurred?

pgoetz · February 14, 2020, 7:08pm

Hi sarulthasan -

Jason is the one creating the particle stack; I just run/debug the systems. I’m pretty sure the issue is with relion_preprocess accumulating all the images (439547 to be exact) into a single mrcs file. We don’t have this problem when working with multiple smaller mrcs files.

stephan · February 19, 2020, 7:32pm

Hi @Jason, @pgoetz,

This turns out to be a problem with how the Select2D job works. When plotting the particles you see in the “Overview” tab, we read in the entire particle stack because we assume it came from one micrograph (so not too many). In this case, we might have to be less lazy and only open the particle images we need so we don’t use the entire system’s memory! For the time being, is it possible if you can split up the particle stack into smaller .mrc files? We’ll fix this and update this post when it’s done. Thank you for reporting!

stephan · February 26, 2020, 10:58pm

Hey @Jason, @pgoetz,

EDIT: The fix for this bug has now been released in the latest version of cryoSPARC (v2.15.0). If you are on this version there is no need to download and install the patch file linked below.

Please see this post, the linked file includes a fix for the issue you’re having.

Jason · February 27, 2020, 4:50am

Thanks, we’ll give it a try and report back!
Jason

pgoetz · February 27, 2020, 5:10pm

OK, run.py has been updated; now waiting for the users to report back.