Issue on a cluster

Juha · February 6, 2020, 7:46pm

I successfully installed the latest version on our HPC cluster. The cluster admins however noticed that the jobs (per-particle motion correction) was “forking 15000 open operations in a short amount of time” causing issues with the Lustre “lustre metadata operation queue”, causing everything to slow down. Any suggestions how to set up cryosparc so that this wouldn’t happen?

stephan · February 7, 2020, 4:53pm

Hi @Juha,

Can you provide more information:

Exact job names that were being run when these issues occurred
Number of movies being processed
Number of particles per image/ total number of particles

Juha · February 7, 2020, 6:04pm

Hi @stephan,

The only job that was running was “Local Motion Correction (multi)”. It was using 24 cores and 4 GPUs, and was properly submitted to a compute node with sufficient resources. The number of movies in the job was 12,000. The total number of particles was 1,800,000 (so there were 150 particles per image on the average).

stephan · February 7, 2020, 6:49pm

Hey @Juha,

The Local Motion Correction (Multi) job:

reads a single movie,
extracts every particle in the movie
writes the particle stack to disk (1 particle stack per movie file)
Note: This is per GPU

Therefore, you may see about double the number of files being opened on the file system during the local motion job. Can you report when the job started running and when the job completed? You’ll find this information in the job details panel.

Juha · February 7, 2020, 6:53pm

Thanks. Here’s the info:

STARTED
Wed Feb 05, 20 05:22:58 PM +02:00

FAILED
Fri Feb 07, 20 07:58:27 PM +02:00

(The job didn’t run to completion as I was asked to stop cryosparc server.)

edit: the failed time does not make sense, I think that was updated when I tried to mark the failed job complete. (this didn’t work)

stephan · February 7, 2020, 6:58pm

Hi @Juha,

Great, about how many movies did the job complete before being terminated?

Juha · February 7, 2020, 7:07pm

8327 movies were completed

stephan · February 13, 2020, 5:26pm

Hi @Juha,

Is it possible if you can try using less GPUs so that the Lustre file system doesn’t get overloaded?