Issue on a cluster

open

#1

I successfully installed the latest version on our HPC cluster. The cluster admins however noticed that the jobs (per-particle motion correction) was “forking 15000 open operations in a short amount of time” causing issues with the Lustre “lustre metadata operation queue”, causing everything to slow down. Any suggestions how to set up cryosparc so that this wouldn’t happen?


#2

Hi @Juha,

Can you provide more information:

  1. Exact job names that were being run when these issues occurred
  2. Number of movies being processed
  3. Number of particles per image/ total number of particles

#3

Hi @sarulthasan,

The only job that was running was “Local Motion Correction (multi)”. It was using 24 cores and 4 GPUs, and was properly submitted to a compute node with sufficient resources. The number of movies in the job was 12,000. The total number of particles was 1,800,000 (so there were 150 particles per image on the average).


#4

Hey @Juha,

The Local Motion Correction (Multi) job:

  1. reads a single movie,
  2. extracts every particle in the movie
  3. writes the particle stack to disk (1 particle stack per movie file)
    Note: This is per GPU

Therefore, you may see about double the number of files being opened on the file system during the local motion job. Can you report when the job started running and when the job completed? You’ll find this information in the job details panel.


#5

Thanks. Here’s the info:

STARTED
Wed Feb 05, 20 05:22:58 PM +02:00

FAILED
Fri Feb 07, 20 07:58:27 PM +02:00

(The job didn’t run to completion as I was asked to stop cryosparc server.)

edit: the failed time does not make sense, I think that was updated when I tried to mark the failed job complete. (this didn’t work)


#6

Hi @Juha,

Great, about how many movies did the job complete before being terminated?


#7

8327 movies were completed


#8

Hi @Juha,

Is it possible if you can try using less GPUs so that the Lustre file system doesn’t get overloaded?