Local refine job fail after iteration 1 (memory?)

ClaudiaKielkopf · June 29, 2021, 12:59pm

Dear all,

I’m observing odd behaviour in a series of local refinement (legacy) jobs I’m running after symmetry expansion. I’m trying different masks on parts of a protein complex and while one mask/job runs perfectly fine and converges after iteration 12, two others fail after iteration 1 and another one after iteration 23.

This one works:

These don’t:

I prepare the masks according to the instruction on the blog, using segmentation in chimera and treat them all the same in terms of dilation/softening. The only difference is (should be) that they cover different parts of the protein complex, but are very close to each other (in fact, the different masks overlap slightly). All other inputs are the same. Since there is no “real” error, I can only guess why they fail, I have someone from our cluster investigate the cluster side of it. We are currently running CS v3.1.0.

Could it be that some masks/subvolumes are computationally more expensive and the job runs out of memory? (I think this is happening for us at the moment with new local refinement). But there is also this sudden jump in what it shows as CPU memory? … Should I play around with masking? Dynamic vs static mask?

This is our submission script:

#!/bin/bash
#PBS -N cryosparc_P24_J608
#PBS -l nodes=1:ppn=4:gpus=1:shared
#PBS -l mem=96000mb
#PBS -l walltime=168:00:00
#PBS -o /home/projects/cpr_sbmm/people/clakie/cryoSPARC/P24/J608
#PBS -e /home/projects/cpr_sbmm/people/clakie/cryoSPARC/P24/J608

available_devs=""
for devidx in $(seq 0 15);
do
if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then
if [[ -z “$available_devs” ]] ; then
available_devs=$devidx
else
available_devs=$available_devs,$devidx
fi
fi
done

/home/people/cryosparc/software/cryosparc/cryosparc2_worker/bin/cryosparcw run --project P24 --job J608 --master_hostname g-05-c0351.cm.cluster --master_command_core_port 39002 > /home/projects/cpr_sbmm/people/clakie/cryoSPARC/P24/J608/job.log 2>&1

Any help and hints are much appreciated!!

Best,
Claudia

mmclean · June 30, 2021, 3:26pm

Hi @ClaudiaKielkopf,

Could you post (or DM me) the job log for the two jobs that failed? You can find this using the command cryosparcm joblog PX JY where X is the project number and Y is the job number. At least in the last image, it’s very likely that the job ran out of CPU RAM. The middle image is strange though as it seems unlikely the memory usage jumped from 37GB to over 96GB, and I wonder if there was some other error that caused the termination. The memory usage shouldn’t depend on the mask itself, only the box size.

Best,
Michael

ClaudiaKielkopf · July 1, 2021, 7:59am

Hej Michael,

thanks for your message! It looks like there was an issue on our cluster, and the jobs failing had nothing to do with CS.

From the job log for a job that failed in iteration 1:

Jobs Queued: [(‘P24’, ‘J608’)]
Licenses currently active : 1
Now trying to schedule J608
Need slots : {‘CPU’: 4, ‘GPU’: 1, ‘RAM’: 3}
Need fixed : {‘SSD’: False}
Master direct : False
Scheduling job to computerome2
Failed to connect link: HTTP Error 502: Bad Gateway
Not a commercial instance - heartbeat set to 12 hours.
Launchable! – Launching.
Changed job P24.J608 status launched
Running project UID P24 job UID J608
Running job on worker type cluster
cmd: ssh g-05-c0351 qsub /home/projects/cpr_sbmm/people/clakie/cryoSPARC/P24/J608/queue_sub_script.sh
Changed job P24.J608 status started
Changed job P24.J608 status running
Changed job P24.J608 status failed

I’m currently re-running the jobs that failed in iteration 1 and so far it looks good. I’m also re-running the job that failed in iteration 23, we’ll see if that one failed also because of the cluster or did indeed run out of memory.

One (minor) issue we ran into was that our IT person couldn’t recover the job logs from the jobs that failed a week ago, seems like these job logs are being overwritten at some stage (bad time to go on vacation when jobs fail ;-))

I’ll post another update when the jobs finish or fail.

Thanks,
Claudia