Memory error only when using the SSD

marino-j · September 5, 2018, 1:25pm

Dear All, I am facing a serious problem with the second version of cryosparc, and we do not know how to solve it.
We had before cryosparc 1 installed on single GPU nodes of our cluster, and all was OK. Now, the new version runs on the batch system, that means that I submit a job and it runs on a GPU node available at that moment. Our GPU nodes have 4 GTX 1080 cards, with 11 GB of ram. The problem is that when I enable the SSD usage, it crashes soon after it starts, saying that it runs out of memory. the same jobs runs well when SSD in not enabled, but it clearly takes much longer. By monitoring the GPU usage with SSD, I see that it rarely exceeds half of the ram before crashing. Asking to parallelize the job on more than 1 GPU does not solve the issue.

I kindly ask directly the developers of cryosparc whether they could give any suggestion to solve this issue.

Thank you and kind regards,
Jacopo.

ali.h · September 5, 2018, 4:28pm

Hello Jacopo,

Thank you for reporting this issue.

Could you share the error message that you get when the job crashes?

Best,
Ali H.

marino-j · September 6, 2018, 8:00am

Dear Ali,

thank you for your help. I paste the entire page:

Launching job on lane merlin5 target merlin5 …

License is valid.

Launching job on cluster merlin5

====================== Cluster submission script: ========================

#!/usr/bin/env bash

cryoSPARC cluster submission script template for SLURM

Available variables:

/gpfs/data/marino_j/cryosparc/v2/cryosparc/cryosparc2_worker/bin/cryosparcw run --project P4 --job J100 --master_hostname merlin-l-02.psi.ch --master_command_core_port 39002 > /gpfs/data/marino_j/rhonofab/J100/job.log 2>&1 - the complete command string to run the job

2 - the number of CPUs needed

1 - the number of GPUs needed.

Note: the code will use this many GPUs starting from dev id 0

the cluster scheduler or this script have the responsibility

of setting CUDA_VISIBLE_DEVICES so that the job code ends up

using the correct cluster-allocated GPUs.

16.0 - the amount of RAM needed in GB

/gpfs/data/marino_j/rhonofab/J100 - absolute path to the job directory

/gpfs/data/marino_j/rhonofab - absolute path to the project dir

/gpfs/data/marino_j/rhonofab/J100/job.log - absolute path to the log file for the job

/gpfs/data/marino_j/cryosparc/v2/cryosparc/cryosparc2_worker/bin/cryosparcw - absolute path to the cryosparc worker command

–project P4 --job J100 --master_hostname merlin-l-02.psi.ch --master_command_core_port 39002 - arguments to be passed to cryosparcw run

P4 - uid of the project

J100 - uid of the job

What follows is a simple SLURM script:

#SBATCH --job-name cryosparc_P4_J100
##SBATCH -n 2
##SBATCH --gres=gpu:1
#SBATCH -p gpu
#SBATCH --mem=16000MB
#SBATCH -o /gpfs/data/marino_j/rhonofab/J100/job.out
#SBATCH -e /gpfs/data/marino_j/rhonofab/J100/job.err
#SBATCH --nodes=1
#SBATCH --exclusive
##SBATCH -w merlin-g-02

available_devs=""
for devidx in $(seq 0 15);
do
if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then
if [[ -z “$available_devs” ]] ; then
available_devs=$devidx
else
available_devs=$available_devs,$devidx
fi
fi
done
export CUDA_VISIBLE_DEVICES=$available_devs

hostname
echo “CUDA_VISIBLE_DEVICES : ${CUDA_VISIBLE_DEVICES}”
echo "/gpfs/data/marino_j/cryosparc/v2/cryosparc/cryosparc2_worker/bin/cryosparcw run --project P4 --job J100 --master_hostname merlin-l-02.psi.ch --master_command_core_port 39002 > /gpfs/data/marino_j/rhonofab/J100/job.log 2>&1 "

/gpfs/data/marino_j/cryosparc/v2/cryosparc/cryosparc2_worker/bin/cryosparcw run --project P4 --job J100 --master_hostname merlin-l-02.psi.ch --master_command_core_port 39002 > /gpfs/data/marino_j/rhonofab/J100/job.log 2>&1

==========================================================================

-------- Submission command:
sbatch /gpfs/data/marino_j/rhonofab/J100/queue_sub_script.sh

-------- Cluster Job ID:
770971

-------- Queued at 2018-09-03 11:32:55.399259

-------- Job status at 2018-09-03 11:32:55.420849
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
770971 gpu cryospar marino_j PD 0:00 1 (None)

Project P4 Job J100 Started

Master running v2.0.27, worker running v2.0.27

Running on lane merlin5

Resources allocated:

Worker: merlin5

CPU : [0, 1]

GPU : [0]

RAM : [0, 1]

SSD : True

Importing job module for job type class_2D…

Job ready to run

Using random seed of 1019575907

Loading a ParticleStack with 161982 items…

SSD cache : cache successfuly synced in_use

SSD cache : cache successfuly synced, found 0.00MB of files on SSD.

SSD cache : cache successfuly requested to check 449 files.

SSD cache : cache requires 40495.94MB more on the SSD for files to be downloaded.

SSD cache : cache has enough available space.

Transferring J95/localmotioncorrected/FoilHole_144661_Data_128122_128123_20180201_2104_Fractions_particles_local_aligned.mrc (32MB)
Complete : 40464MB
Total : 40496MB
Speed : 197.29MB/s

SSD cache : complete, all requested files are available on SSD.

Done.

Windowing particles

Done.

Using 50 classes.

Computing 2D class averages:

Volume Size: 128 (voxel size 2.22A)

Zeropadded Volume Size: 256

Data Size: 256 (pixel size 1.11A)

Using Resolution: 6.00A (47.0 radius)

Windowing only corners of 2D classes at each iteration.

Using random seed for initialization of 825386084

Done in 1.117s.

Start of Iteration 0

Traceback (most recent call last):
File “cryosparc2_compute/jobs/runcommon.py”, line 705, in run_with_except_hook
run_old(*args, **kw)
File “cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py”, line 92, in cryosparc2_compute.engine.cuda_core.GPUThread.run
File “cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py”, line 93, in cryosparc2_compute.engine.cuda_core.GPUThread.run
File “cryosparc2_worker/cryosparc2_compute/engine/engine.py”, line 832, in cryosparc2_compute.engine.engine.process.work
File “cryosparc2_worker/cryosparc2_compute/engine/engine.py”, line 226, in cryosparc2_compute.engine.engine.EngineThread.compute_resid_pow
File “cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py”, line 233, in cryosparc2_compute.engine.cuda_core.EngineBaseThread.ensure_allocated
File “cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py”, line 101, in cryosparc2_compute.engine.cuda_core.allocate_cpu
MemoryError: cuMemHostAlloc failed: out of memory

apunjani · September 17, 2018, 5:42am

Hi @marino-j
Thanks again for reporting this.
It appears that it’s actually not the GPU memory that is running out, but the CPU memory. The error message is actually happening when attempting to allocate memory on the Host (CPU). Given that you are running this on a cluster, this is most likely happening because the cryosparc job process temporarily needs more RAM than is specified in the cluster submission script (16000MB), only when also caching particles from the SSD (which for some reason, requires more RAM as well…) but your cluster is set up with strict memory limits on running processes, so the process dies at this point.

In the next upcoming version we have increased the default memory request for class2D to 24000MB which should solve the problem, but temporarily you can also just hardcode the cluster submission script to a larger value for the memory request.

Ali

marino-j · September 18, 2018, 1:41pm

Dear Ali,

thank you for your help. We have increased the limit to 64000 MB and so far we are not getting the error anymore.

Best wishes,
Jacopo

Vladimir · September 22, 2018, 1:48pm

Hi Ali,

Actually, we still have a problem. While for some of the jobs increasing to 64GB worked, there is a case now which produces this error even setting memory limit to 120GB. Machines in our cluster have only 128GB, so we can not increase the limit more. The cryoSPARC version is 2.2.0.

Similar jobs without ssd are running, but significantly slower.

Cheers,
Vladimir