Is this a bug of 2.9 or something is wrong with my script?

closed

#1

Apologies for pasting something so long. I see that it prints the same phrase several times:

-------- Submission command: 
sbatch /data/project/bio/schertler/Jacopo/modified/P1/J84/queue_sub_script.sh

-------- Cluster Job ID: 
134329506

-------- Queued at 2019-07-09 22:15:27.837448

-------- Job status at 2019-07-09 22:15:27.893762
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
         134329506       gpu cryospar marino_j PD       0:00      1 (None)

Project P1 Job J84 Started

Project P1 Job J84 Started

Project P1 Job J84 Started

Project P1 Job J84 Started

Project P1 Job J84 Started

Master running v2.9.0, worker running v2.9.0

Project P1 Job J84 Started

Project P1 Job J84 Started

Master running v2.9.0, worker running v2.9.0

Project P1 Job J84 Started

Project P1 Job J84 Started

Project P1 Job J84 Started

Project P1 Job J84 Started

Project P1 Job J84 Started

Project P1 Job J84 Started

Master running v2.9.0, worker running v2.9.0

Master running v2.9.0, worker running v2.9.0

Project P1 Job J84 Started

Project P1 Job J84 Started

Master running v2.9.0, worker running v2.9.0

Project P1 Job J84 Started

Project P1 Job J84 Started

Master running v2.9.0, worker running v2.9.0

Project P1 Job J84 Started

Master running v2.9.0, worker running v2.9.0

Master running v2.9.0, worker running v2.9.0

Project P1 Job J84 Started

Master running v2.9.0, worker running v2.9.0

Master running v2.9.0, worker running v2.9.0

Project P1 Job J84 Started

Master running v2.9.0, worker running v2.9.0

Master running v2.9.0, worker running v2.9.0

Master running v2.9.0, worker running v2.9.0

Master running v2.9.0, worker running v2.9.0

Master running v2.9.0, worker running v2.9.0

Master running v2.9.0, worker running v2.9.0

Master running v2.9.0, worker running v2.9.0

Master running v2.9.0, worker running v2.9.0

Master running v2.9.0, worker running v2.9.0

Master running v2.9.0, worker running v2.9.0

Running on lane merlin6-big

Resources allocated: 

  Worker:  merlin6-big

  CPU   :  [0, 1]

  GPU   :  [0]

  RAM   :  [0]

  SSD   :  True

Running on lane merlin6-big

--------------------------------------------------------------

Resources allocated: 

Importing job module for job type homo_abinit...

  Worker:  merlin6-big

  CPU   :  [0, 1]

  GPU   :  [0]

  RAM   :  [0]

  SSD   :  True

--------------------------------------------------------------

Importing job module for job type homo_abinit...

Running on lane merlin6-big

Resources allocated: 

  Worker:  merlin6-big

  CPU   :  [0, 1]

  GPU   :  [0]

Running on lane merlin6-big

  RAM   :  [0]

Resources allocated: 

  SSD   :  True

  Worker:  merlin6-big

--------------------------------------------------------------

  CPU   :  [0, 1]

Importing job module for job type homo_abinit...

  GPU   :  [0]

  RAM   :  [0]

  SSD   :  True

--------------------------------------------------------------

Running on lane merlin6-big

Importing job module for job type homo_abinit...

Resources allocated: 

  Worker:  merlin6-big

  CPU   :  [0, 1]

  GPU   :  [0]

  RAM   :  [0]

  SSD   :  True

--------------------------------------------------------------

Importing job module for job type homo_abinit...

Running on lane merlin6-big

Resources allocated: 

  Worker:  merlin6-big

  CPU   :  [0, 1]

  GPU   :  [0]

  RAM   :  [0]

Running on lane merlin6-big

Running on lane merlin6-big

  SSD   :  True

Running on lane merlin6-big

Resources allocated: 

Resources allocated: 

--------------------------------------------------------------

Resources allocated: 

  Worker:  merlin6-big

  Worker:  merlin6-big

Importing job module for job type homo_abinit...

  Worker:  merlin6-big

  CPU   :  [0, 1]

  CPU   :  [0, 1]

  CPU   :  [0, 1]

  GPU   :  [0]

  GPU   :  [0]

  GPU   :  [0]

  RAM   :  [0]

  RAM   :  [0]

  RAM   :  [0]

  SSD   :  True

  SSD   :  True

Running on lane merlin6-big

  SSD   :  True

--------------------------------------------------------------

--------------------------------------------------------------

--------------------------------------------------------------

Resources allocated: 

Running on lane merlin6-big

Importing job module for job type homo_abinit...

Importing job module for job type homo_abinit...

  Worker:  merlin6-big

Importing job module for job type homo_abinit...

Resources allocated: 

Running on lane merlin6-big

  CPU   :  [0, 1]

  Worker:  merlin6-big

Resources allocated: 

Running on lane merlin6-big

Running on lane merlin6-big

Running on lane merlin6-big

  CPU   :  [0, 1]

  GPU   :  [0]

  Worker:  merlin6-big

Resources allocated: 

Running on lane merlin6-big

Resources allocated: 

Resources allocated: 

  CPU   :  [0, 1]

  RAM   :  [0]

  GPU   :  [0]

  Worker:  merlin6-big

  Worker:  merlin6-big

  Worker:  merlin6-big

Resources allocated: 

  GPU   :  [0]

  SSD   :  True

  RAM   :  [0]

  CPU   :  [0, 1]

  CPU   :  [0, 1]

Running on lane merlin6-big

Running on lane merlin6-big

Running on lane merlin6-big

Running on lane merlin6-big

  RAM   :  [0]

Resources allocated: 

  GPU   :  [0]

  CPU   :  [0, 1]

  GPU   :  [0]

Resources allocated: 

--------------------------------------------------------------

  Worker:  merlin6-big

  SSD   :  True

Resources allocated: 

Resources allocated: 

  SSD   :  True

  Worker:  merlin6-big

  GPU   :  [0]

  RAM   :  [0]

  RAM   :  [0]

  Worker:  merlin6-big

Importing job module for job type homo_abinit...

  CPU   :  [0, 1]

--------------------------------------------------------------

  Worker:  merlin6-big

  Worker:  merlin6-big

--------------------------------------------------------------

  CPU   :  [0, 1]

  RAM   :  [0]

  SSD   :  True

  SSD   :  True

  CPU   :  [0, 1]

  GPU   :  [0]

Importing job module for job type homo_abinit...

  CPU   :  [0, 1]

  CPU   :  [0, 1]

Importing job module for job type homo_abinit...

  GPU   :  [0]

  SSD   :  True

--------------------------------------------------------------

--------------------------------------------------------------

  GPU   :  [0]

  RAM   :  [0]

  GPU   :  [0]

  GPU   :  [0]

  RAM   :  [0]

--------------------------------------------------------------

Importing job module for job type homo_abinit...

Importing job module for job type homo_abinit...

  RAM   :  [0]

  SSD   :  True

  RAM   :  [0]

  RAM   :  [0]

  SSD   :  True

Importing job module for job type homo_abinit...

  SSD   :  True

--------------------------------------------------------------

  SSD   :  True

  SSD   :  True

--------------------------------------------------------------

--------------------------------------------------------------

Importing job module for job type homo_abinit...

--------------------------------------------------------------

--------------------------------------------------------------

Importing job module for job type homo_abinit...

Importing job module for job type homo_abinit...

Importing job module for job type homo_abinit...

Importing job module for job type homo_abinit...

Job ready to run

Job ready to run

Job ready to run

Job ready to run

Job ready to run

Job ready to run

Job ready to run

Job ready to run

Job ready to run

Job ready to run

Job ready to run

Job ready to run

Job ready to run

Job ready to run

Job ready to run

Job ready to run

Job ready to run

Job ready to run

Job ready to run

Job ready to run

***************************************************************

***************************************************************

***************************************************************

***************************************************************

***************************************************************

***************************************************************

***************************************************************

***************************************************************

***************************************************************

***************************************************************

***************************************************************

***************************************************************

***************************************************************

***************************************************************

***************************************************************

***************************************************************

***************************************************************

***************************************************************

***************************************************************

***************************************************************

Using random seed for sgd of 1780068420

Using random seed for sgd of 1167987528

Using random seed for sgd of 1180695064

Using random seed for sgd of 86321427

Loading a ParticleStack with 153707 items...

Loading a ParticleStack with 153707 items...

Loading a ParticleStack with 153707 items...

Using random seed for sgd of 1406879033

Loading a ParticleStack with 153707 items...

Using random seed for sgd of 1673007703

Loading a ParticleStack with 153707 items...

Loading a ParticleStack with 153707 items...

Using random seed for sgd of 580491404

Loading a ParticleStack with 153707 items...

Using random seed for sgd of 1466587809

Using random seed for sgd of 1648577409

Loading a ParticleStack with 153707 items...

Loading a ParticleStack with 153707 items...

Using random seed for sgd of 1629833240

Loading a ParticleStack with 153707 items...

Using random seed for sgd of 1362294214

Using random seed for sgd of 1654083681

Using random seed for sgd of 1715761950

Using random seed for sgd of 2020020785

Using random seed for sgd of 280703820

Using random seed for sgd of 2077377045

Using random seed for sgd of 814078100

Using random seed for sgd of 1142947647

Using random seed for sgd of 1542294467

Loading a ParticleStack with 153707 items...

Using random seed for sgd of 958433435

Loading a ParticleStack with 153707 items...

Loading a ParticleStack with 153707 items...

Loading a ParticleStack with 153707 items...

Loading a ParticleStack with 153707 items...

Loading a ParticleStack with 153707 items...

Loading a ParticleStack with 153707 items...

Loading a ParticleStack with 153707 items...

Loading a ParticleStack with 153707 items...

Loading a ParticleStack with 153707 items...

 SSD cache : cache successfuly synced in_use

 SSD cache : cache successfuly synced in_use

 SSD cache : cache successfuly synced in_use

 SSD cache : cache successfuly synced in_use

 SSD cache : cache successfuly synced in_use

 SSD cache : cache successfuly synced in_use

 SSD cache : cache successfuly synced in_use

 SSD cache : cache successfuly synced in_use

 SSD cache : cache successfuly synced in_use

 SSD cache : cache successfuly synced in_use

 SSD cache : cache successfuly synced in_use

 SSD cache : cache successfuly synced in_use

 SSD cache : cache successfuly synced in_use

 SSD cache : cache successfuly synced in_use

 SSD cache : cache successfuly synced in_use

 SSD cache : cache successfuly synced in_use

 SSD cache : cache successfuly synced in_use

 SSD cache : cache successfuly synced in_use

 SSD cache : cache successfuly synced in_use

 SSD cache : cache successfuly synced in_use

 SSD cache : cache successfuly synced, found 0.00MB of files on SSD.

 SSD cache : cache successfuly synced, found 0.00MB of files on SSD.

 SSD cache : cache successfuly synced, found 0.00MB of files on SSD.

 SSD cache : cache successfuly synced, found 0.00MB of files on SSD.

 SSD cache : cache successfuly synced, found 0.00MB of files on SSD.

 SSD cache : cache successfuly synced, found 0.00MB of files on SSD.

 SSD cache : cache successfuly synced, found 0.00MB of files on SSD.

 SSD cache : cache successfuly synced, found 0.00MB of files on SSD.

 SSD cache : cache successfuly synced, found 0.00MB of files on SSD.

 SSD cache : cache successfuly synced, found 0.00MB of files on SSD.

 SSD cache : cache successfuly synced, found 0.00MB of files on SSD.

 SSD cache : cache successfuly synced, found 0.00MB of files on SSD.

 SSD cache : cache successfuly synced, found 0.00MB of files on SSD.

 SSD cache : cache successfuly synced, found 0.00MB of files on SSD.

 SSD cache : cache successfuly synced, found 0.00MB of files on SSD.

 SSD cache : cache successfuly synced, found 0.00MB of files on SSD.

 SSD cache : cache successfuly synced, found 0.00MB of files on SSD.

 SSD cache : cache successfuly synced, found 0.00MB of files on SSD.

 SSD cache : cache successfuly synced, found 0.00MB of files on SSD.

 SSD cache : cache successfuly synced, found 0.00MB of files on SSD.

 SSD cache : cache successfuly requested to check 8975 files.

 SSD cache : cache waiting for requested files to become unlocked.

 SSD cache : cache waiting for requested files to become unlocked.

 SSD cache : cache waiting for requested files to become unlocked.

 SSD cache : cache waiting for requested files to become unlocked.

 SSD cache : cache waiting for requested files to become unlocked.

 SSD cache : cache waiting for requested files to become unlocked.

 SSD cache : cache waiting for requested files to become unlocked.

 SSD cache : cache waiting for requested files to become unlocked.

 SSD cache : cache waiting for requested files to become unlocked.

 SSD cache : cache waiting for requested files to become unlocked.

 SSD cache : cache waiting for requested files to become unlocked.

 SSD cache : cache waiting for requested files to become unlocked.

 SSD cache : cache waiting for requested files to become unlocked.

 SSD cache : cache waiting for requested files to become unlocked.

 SSD cache : cache waiting for requested files to become unlocked.

 SSD cache : cache waiting for requested files to become unlocked.

 SSD cache : cache waiting for requested files to become unlocked.

 SSD cache : cache waiting for requested files to become unlocked.

 SSD cache : cache waiting for requested files to become unlocked.

 SSD cache : cache requires 1112020.77MB more on the SSD for files to be downloaded.

#2

in case it helps, here is the cluster_script_sh:

[marino_j@merlin-l-02 merlin6-big]$ vi cluster_script.sh
{%- macro _min(a, b) -%}
  {%- if a <= b %}{{a}}{% else %}{{b}}{% endif -%}
{%- endmacro -%}

# Available variables:
#  script_path_abs={{ script_path_abs }}
#      - the absolute path to the generated submission script
#  run_cmd={{ run_cmd }}
#      - the complete command-line string to run the job
#  num_cpu={{ num_cpu }}
#      - the number of CPUs needed
#  num_gpu={{ num_gpu }}
#      - the number of GPUs needed. Note: the code will use this many GPUs
#        starting from dev id 0. The cluster scheduler or this script have the
#        responsibility of setting CUDA_VISIBLE_DEVICES so that the job code
#        ends up using the correct cluster-allocated GPUs.
#  ram_gb={{ ram_gb }}
#      - the amount of RAM needed in GB
#  job_dir_abs={{ job_dir_abs }}
#      - absolute path to the job directory
#  project_dir_abs={{ project_dir_abs }}
#      - absolute path to the project dir
#  job_log_path_abs={{ job_log_path_abs }}
#      - absolute path to the log file for the job
#  worker_bin_path={{ worker_bin_path }}
#      - absolute path to the cryosparc worker command
#  run_args={{ run_args }}
#      - arguments to be passed to cryosparcw run
#  project_uid={{ project_uid }}
#      - uid of the project
#  job_uid={{ job_uid }}
#      - uid of the job
#  job_creator={{ job_creator }}
#      - name of the user that created the job (may contain spaces)
#  cryosparc_username={{ cryosparc_username }}
#      - cryosparc username of the user that created the job (usually an email)

#SBATCH --job-name=cryosparc_{{ project_uid }}_{{ job_uid }}
#SBATCH --output={{ job_log_path_abs }}.out
#SBATCH --error={{ job_log_path_abs }}.err
#SBATCH --cpus-per-task=1
#SBATCH --threads-per-core=1
#SBATCH --partition=gpu
#SBATCH --exclusive
#SBATCH --mem=100000
#SBATCH --exclude=merlin-g-001

%- if num_gpu == 0 %}
# Use CPU cluster
#SBATCH --constraint=mc
#SBATCH --ntasks={{ num_cpu }}

My jobs do finish, so this is not an issue. I also see that they are labelled as “failed” but they still run and when they finish they are marked as “complete”.

Thanks a lot for your help !


#3

Hi @marino-j,

It looks like there’s not enough space on the SSD to cache the particles for the ab-initio job. Try setting the SSD caching off:

image

- Suhail