I recently ran updates on our Ubuntu 18.04.2 cryosparc worker nodes, which updated the Nvidia drivers from version 430 to version 440. Now cryosparc doesn’t work any more (with either the Cuda 9.1 or the Cuda 9.2 libraries). Jobs which ran prior to the update now fail with this error message:
[CPU: 244.37 GB] Traceback (most recent call last):
File "cryosparc2_master/cryosparc2_compute/run.py", line 78, in cryosparc2_compute.run.main
File "cryosparc2_compute/jobs/select2D/run.py", line 262, in run
_, particles_to_plot = mrc.read_mrc(os.path.join(proj_dir_abs, particles_dset_exclude.data['blob/path'][0]))
File "cryosparc2_compute/blobio/mrc.py", line 114, in read_mrc
data = read_mrc_data(file_obj, header, start_page, end_page, out)
File "cryosparc2_compute/blobio/mrc.py", line 77, in read_mrc_data
data = n.fromfile(file_obj, dtype=dtype, count= num_pages * ny * nx).reshape(num_pages, ny, nx)
MemoryError
I’m going to try and revert the nvidia drivers, but am a loss for why an nvidia kernel driver updated would cause cryosparc to stop working.
In general, CryoSPARC seems to be very brittle with respect to Nvidia drivers/Cuda libraries. Can someone tell me which versions of the Nvidia drivers / Cuda libraries have been tested to work with CryoSPARC?
It looks like the nvidia-440 drivers might need Cuda 10.2. Trying this now. In the past, cryoSPARC needed precisely the Cuda 9.1 drivers or wouldn’t run, but maybe this has been fixed.
I have cryosparc v12.3 with cuda 10.2 and the latest version of the nvidia drivers for GTX 1080Ti in two machines. They run fine with most of the Jobs but I am having problems with local refinement jobs that include a non-uniform refinement option.
It looks like your job is using almost 250GB of CPU memory, and your error message says “Memory Error” after a numpy function call. I think your issue may be related to CPU memory instead. What job are you running? Based on the logs, it looks like it’s a Select 2D job, which doesn’t require GPUs, meaning your recent CUDA change wouldn’t have an effect. It may be possible that your particle dataset is so huge it’s not fitting on memory. Can you also report how many particles you’re trying to select? If this is a large amount, you can use multiple Select 2D jobs connected to the 2D Classification job to select a few classes at a time.
How are you computing that we’re using 250GB of CPU memory, and presumably most of this is allocated or VIRT and not RSS. Both the CryoSPARC worker nodes are equipped with 128G of RAM, but only 64G of swap. I’ve asked one of the users to chime in (I’m the admin), but my understanding is jobs that ran last week or a couple of weeks ago on the same machine are now crashing with a MemoryError.
The only thing that’s changed is we upgraded to v2.13.2 a few days ago and I ran
apt dist-upgrade
on both the worker nodes in order to get the Nvidia driver versions synchronized.
We determined that the issue is with v2.13.2. When we downgraded to v2.12.4 the Select 2D job finished without problems. The issue has nothing to do with the Nvidia drivers because we already reverted those back.
What changed with the Select 2D Classes job between v2.13.2 and v2.12.4? An important piece of data is that this is occurring only when we import particles output from relion_preprocess. This results in 1 star file and 1 large mrcs file containing all the particles. We sometimes need to do this however.
The error occurs after we select the 2D classes and hit ‘done’. The particles are never written out and the job ends up being terminated.
Jason is the one creating the particle stack; I just run/debug the systems. I’m pretty sure the issue is with relion_preprocess accumulating all the images (439547 to be exact) into a single mrcs file. We don’t have this problem when working with multiple smaller mrcs files.
This turns out to be a problem with how the Select2D job works. When plotting the particles you see in the “Overview” tab, we read in the entire particle stack because we assume it came from one micrograph (so not too many). In this case, we might have to be less lazy and only open the particle images we need so we don’t use the entire system’s memory! For the time being, is it possible if you can split up the particle stack into smaller .mrc files? We’ll fix this and update this post when it’s done. Thank you for reporting!
EDIT: The fix for this bug has now been released in the latest version of cryoSPARC (v2.15.0). If you are on this version there is no need to download and install the patch file linked below.
Please see this post, the linked file includes a fix for the issue you’re having.