Rebalance 2D classes failure to launch

Hi everyone!

I’m relatively new to cryoSPARC and I’m trying to bin my 2D classes with “Rebalance 2D classes”. I’ve manually picked particles and generated 2D class averages, and of 50 classes, they seem to fall into 3 “good” bins that look like what I want and one or two “junk” bins.

So, I input my particles and 2D classes to the Rebalance job, tell it I want 5 bins, and then hit “Queue”. It fails immediately, but the job remains in the builder until I queue it again, at which point it fails because the name is redundant (I think; error message and other relevant information below). I’ve been running jobs all day and everything else has run without issue. Does anyone have any ideas about what could be going on?

Note: I’m using the online cryoSPARC v2 and ssh’ing into my data and the server for data processing.

Additional note: I’ve removed names and identifying information. If you see the word “removed” in the code below, this is why.

Initial error:

License is valid.

Launching job on lane removed target removed ...

Launching job on cluster removed


====================== Cluster submission script: ========================
==========================================================================
#!/usr/bin/env bash
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --ntasks=0
#SBATCH --gres=gpu:0
#SBATCH --mem=0MB
#SBATCH --time=36:00:00
#SBATCH --exclusive
#SBATCH --job-name P120J236
#SBATCH --output=/data/work/name_removed/cryosparc/P120/J236output.txt
#SBATCH --error=/data/work/name_removed/cryosparc/P120/J236error.txt

available_devs=""
for devidx in $(seq 0 15);
do
    if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then
        if [[ -z "$available_devs" ]] ; then
            available_devs=$devidx
        else
            available_devs=$available_devs,$devidx
        fi
    fi
done
export CUDA_VISIBLE_DEVICES=$available_devs

/net/removed/data/software/cryoSPARC/V2/cryosparc2_worker/bin/cryosparcw run --project P120 --job J236 --master_hostname removed --master_command_core_port 39002 > /data/work/name_removed/cryosparc/P120/J236/job.log 2>&1 
==========================================================================
==========================================================================

-------- Submission command: 
sbatch /data/work/name_removed/cryosparc/P120/J236/queue_sub_script.sh
Failed to launch! 1

When I get this message, the job still is purple, but shows a red dot in the tiled view of my workspace. If I queue it again:

License is valid.

Launching job on lane removed target removed ...

Job directory /data/work/name_removed/cryosparc/P120/J236 is not empty, found: /data/work/name_removed/cryosparc/P120/J236/queue_sub_script.sh

Again, the second error makes sense: it’s trying to put two files with the same name in the same place. But why does the first run of the job fail, and why does it only fail halfway?

Thanks so much for your expertise!
Kate

Hi @kmradford,

Thanks for reporting this. To help us debug, would you be able to copy and paste the job.log file from a new Rebalance job, right after it fails for the first time? This is available in the job directory, and can be viewed by using less job.log

Best,
Michael

Hi @mmclean, I am also having issues starting a “rebalance 2D” job. I am using cryoSPARC v.3.0.0 and when I attempt to queue the job, the message “float division by zero” pops up at the top of the page transiently. The job overview output lists:

License is valid.

Launching job on lane XXX target XXX

Launching job on cluster XXX

and the job status is shown as “launched”. The job directory is created but contains only the files events.bson and job.json and a directory called “gridfs_data” that appears to be empty.

The error occurs whether or not the “do superclassification” option is selected (and when I select “do superclassification”, the number of superclasses is less than the number of templates). Any idea where this float error could be coming from?

Hi @redler,

Thanks for reporting – could I ask you to copy and paste the job metadata here as well? This can be found by clicking on the job card, pressing the space bar, and navigating to the “Metadata” tab.

Best,
Michael

Thanks for your help @mmclean! Metadata is below:

    {
      "_id": {
    "_str": "5ff3889a844b30e4d065c968"
      },
      "children": [],
      "created_at": "2021-01-04T21:28:58.223Z",
      "created_by_job_uid": null,
      "created_by_user_id": "5bb7f3b0a203c734a3a2e487",
      "interactive": false,
      "job_type": "rebalance_classes_2D",
      "launched_at": "2021-01-04T21:29:39.977Z",
      "parents": [
    "J69"
      ],
      "priority": 0,
      "project_uid": "P151",
      "queue_message": null,
      "queued_at": "2021-01-04T21:29:39.726Z",
      "started_at": null,
      "status": "launched",
      "title": "New Job J84",
      "uid": "J84",
      "waiting_at": null,
      "workspace_uids": [
    "W3"
      ],
      "cloned_from": null,
      "deleted": false,
      "type": "rebalance_classes_2D",
      "ui_tile_height": 1,
      "ui_tile_images": [],
      "ui_tile_width": 2,
      "completed_at": null,
      "description": "Enter a description.",
      "failed_at": null,
      "job_dir_size": 0,
      "killed_at": null,
      "last_accessed": {
    "name": "redler01",
    "accessed_at": "2021-01-05T17:52:44.003Z"
      },
      "version": "v3.0.0",
      "run_as_user": null,
      "params_secs": {
    "rebalancing_params": {
      "title": "Rebalancing Parameters",
      "desc": "Parameters controlling the resampling of particles across superclasses.",
      "order": 0
    },
    "general_settings": {
      "title": "General Settings",
      "desc": "",
      "order": 1
    },
    "random": {
      "title": "Random Seeds",
      "desc": "",
      "order": 2
    }
      },
      "params_base": {
    "rebalance_factor": {
      "type": "number",
      "value": 0.1,
      "title": "Rebalance factor",
      "desc": "Factor by which the superclasses are rebalanced. Must be between 0 and 1. Set this to 0 for no rebalancing (all particles kept), or to 1 for uniform rebalancing (all superclasses have the same size). If nonzero, this corresponds to a lower bound on the ratio between the number of particles in the smallest superclass, and the number of particles in the largest superclass.",
      "order": 0,
      "section": "rebalancing_params",
      "advanced": false,
      "hidden": false
    },
    "num_superclasses": {
      "type": "number",
      "value": null,
      "title": "Number of superclasses or templates (integer)",
      "desc": "Corresponds to the approximate number of unique views that are present in the set of templates passed. If \"Do superclassification\" is false, this must be exactly equal to the number of templates passed. If \"Do superclassification\" is true, this must be an integer strictly less than the number of templates passed. Running multiple jobs with different numbers of superclasses may help to find the best clustering.",
      "order": 1,
      "section": "rebalancing_params",
      "advanced": false,
      "hidden": false
    },
    "kernel_width": {
      "type": "number",
      "value": null,
      "title": "RBF kernel width",
      "desc": "Width of the RBF kernel used in spectral clustering.",
      "order": 2,
      "section": "rebalancing_params",
      "advanced": true,
      "hidden": true
    },
    "split_outputs": {
      "type": "boolean",
      "value": false,
      "title": "Split outputs",
      "desc": "Whether the outputs (templates and particles) should be split by superclass/template, or should be merged together.",
      "order": 3,
      "section": "general_settings",
      "advanced": false,
      "hidden": false
    },
    "do_superclassification": {
      "type": "boolean",
      "value": true,
      "title": "Do superclassification",
      "desc": "Whether rebalancing should be based on superclasses, or based directly on the templates passed. If true, will use spectral clustering to resample particles across views (i.e. superclasses). If false, will simply resample particles across the templates passed.",
      "order": 4,
      "section": "general_settings",
      "advanced": false,
      "hidden": false
    },
    "downsampling_factor": {
      "type": "number",
      "value": 2,
      "title": "Downsampling factor (integer)",
      "desc": "Factor to downscale/downsample template images by when maximizing over pose. The image template size must be divisible by this factor. Larger values will result in faster but less accurate affinity/similarity matrix. Common values are 1, 2, or 4.",
      "order": 5,
      "section": "general_settings",
      "advanced": true,
      "hidden": false
    },
    "lowpass_res": {
      "type": "number",
      "value": 15,
      "title": "Lowpass filter corner resolution (Angstroms)",
      "desc": "Corner resolution for lowpass filtering images by prior to clustering.",
      "order": 6,
      "section": "general_settings",
      "advanced": false,
      "hidden": false
    },
    "angle_step": {
      "type": "number",
      "value": 3,
      "title": "Angle step size (degrees)",
      "desc": "Step size (in degrees) to take when computing rotations (for maximizing over pose). Smaller values increase accuracy, at the expense of runtime.",
      "order": 7,
      "section": "general_settings",
      "advanced": true,
      "hidden": false
    },
    "transpose_templates": {
      "type": "boolean",
      "value": false,
      "title": "Transpose templates",
      "desc": "",
      "order": 8,
      "section": "general_settings",
      "advanced": true,
      "hidden": true
    },
    "shift_bound": {
      "type": "number",
      "value": 40,
      "title": "Shift bound (pixels)",
      "desc": "Initial pixel shift bound (deviation from center) when computing horizontal and vertical shifts (for maximizing over pose).",
      "order": 9,
      "section": "general_settings",
      "advanced": true,
      "hidden": true
    },
    "use_bfgs": {
      "type": "boolean",
      "value": true,
      "title": "Use L-BFGS-B optimization",
      "desc": "Use L-BFGS-B algorithm to refine the optimization when computing affinity/similarity matrix.",
      "order": 10,
      "section": "general_settings",
      "advanced": true,
      "hidden": true
    },
    "random_seed": {
      "type": "number",
      "value": null,
      "title": "Random seed",
      "desc": "Set to None to auto generate.",
      "order": 11,
      "section": "random",
      "advanced": false,
      "hidden": false
    }
      },
      "params_spec": {
    "num_superclasses": {
      "value": 20
    }
      },
      "input_slot_groups": [
    {
      "type": "particle",
      "name": "particles",
      "title": "Particles",
      "description": "Particles.",
      "count_min": 1,
      "count_max": 1,
      "repeat_allowed": false,
      "slots": [
        {
          "type": "particle.blob",
          "name": "blob",
          "title": "Particle raw data",
          "description": "",
          "optional": false
        },
        {
          "type": "particle.alignments2D",
          "name": "alignments2D",
          "title": "Particle 2D alignments",
          "description": "",
          "optional": false
        }
      ],
      "connections": [
        {
          "job_uid": "J69",
          "group_name": "particles",
          "slots": [
            {
              "slot_name": "blob",
              "job_uid": "J69",
              "group_name": "particles",
              "result_name": "blob",
              "result_type": "particle.blob",
              "version": "F"
            },
            {
              "slot_name": "alignments2D",
              "job_uid": "J69",
              "group_name": "particles",
              "result_name": "alignments2D",
              "result_type": "particle.alignments2D",
              "version": "F"
            },
            {
              "slot_name": null,
              "job_uid": "J69",
              "group_name": "particles",
              "result_name": "ctf",
              "result_type": "particle.ctf",
              "version": "F"
            },
            {
              "slot_name": null,
              "job_uid": "J69",
              "group_name": "particles",
              "result_name": "location",
              "result_type": "particle.location",
              "version": "F"
            },
            {
              "slot_name": null,
              "job_uid": "J69",
              "group_name": "particles",
              "result_name": "pick_stats",
              "result_type": "particle.pick_stats",
              "version": "F"
            }
          ]
        }
      ]
    },
    {
      "type": "template",
      "name": "templates",
      "title": "2D Class Averages",
      "description": "Class averages (typically output from select_2D job).",
      "count_min": 1,
      "count_max": 1,
      "repeat_allowed": false,
      "slots": [
        {
          "type": "template.blob",
          "name": "blob",
          "title": "Template raw data",
          "description": "",
          "optional": false
        }
      ],
      "connections": [
        {
          "job_uid": "J69",
          "group_name": "class_averages",
          "slots": [
            {
              "slot_name": "blob",
              "job_uid": "J69",
              "group_name": "class_averages",
              "result_name": "blob",
              "result_type": "template.blob",
              "version": "F"
            }
          ]
        }
      ]
    }
      ],
      "output_result_groups": [
    {
      "uid": "J84-G0",
      "type": "particle",
      "name": "particles_selected",
      "title": "Particles selected",
      "description": "",
      "contains": [
        {
          "uid": "J84-R0",
          "type": "particle.blob",
          "group_name": "particles_selected",
          "name": "blob",
          "passthrough": false
        },
        {
          "uid": "J84-R1",
          "type": "particle.alignments2D",
          "group_name": "particles_selected",
          "name": "alignments2D",
          "passthrough": false
        },
        {
          "uid": "J84-R2",
          "type": "particle.ctf",
          "group_name": "particles_selected",
          "name": "ctf",
          "passthrough": true
        },
        {
          "uid": "J84-R3",
          "type": "particle.location",
          "group_name": "particles_selected",
          "name": "location",
          "passthrough": true
        },
        {
          "uid": "J84-R4",
          "type": "particle.pick_stats",
          "group_name": "particles_selected",
          "name": "pick_stats",
          "passthrough": true
        }
      ],
      "passthrough": "particles",
      "num_items": 0
    },
    {
      "uid": "J84-G1",
      "type": "template",
      "name": "templates_all",
      "title": "Templates",
      "description": "",
      "contains": [
        {
          "uid": "J84-R5",
          "type": "template.blob",
          "group_name": "templates_all",
          "name": "blob",
          "passthrough": false
        }
      ],
      "passthrough": false,
      "num_items": 0
    },
    {
      "uid": "J84-G2",
      "type": "particle",
      "name": "particles_excluded",
      "title": "Particles excluded",
      "description": "",
      "contains": [
        {
          "uid": "J84-R6",
          "type": "particle.blob",
          "group_name": "particles_excluded",
          "name": "blob",
          "passthrough": false
        },
        {
          "uid": "J84-R7",
          "type": "particle.alignments2D",
          "group_name": "particles_excluded",
          "name": "alignments2D",
          "passthrough": false
        },
        {
          "uid": "J84-R8",
          "type": "particle.ctf",
          "group_name": "particles_excluded",
          "name": "ctf",
          "passthrough": true
        },
        {
          "uid": "J84-R9",
          "type": "particle.location",
          "group_name": "particles_excluded",
          "name": "location",
          "passthrough": true
        },
        {
          "uid": "J84-R10",
          "type": "particle.pick_stats",
          "group_name": "particles_excluded",
          "name": "pick_stats",
          "passthrough": true
        }
      ],
      "passthrough": "particles",
      "num_items": 0
    }
      ],
      "output_results": [
    {
      "uid": "J84-R0",
      "type": "particle.blob",
      "group_name": "particles_selected",
      "name": "blob",
      "title": "Particle data",
      "description": "",
      "min_fields": [
        [
          "path",
          "O"
        ],
        [
          "idx",
          "u4"
        ],
        [
          "shape",
          "2u4"
        ],
        [
          "psize_A",
          "f4"
        ],
        [
          "sign",
          "f4"
        ]
      ],
      "versions": [],
      "metafiles": [],
      "num_items": [],
      "passthrough": false
    },
    {
      "uid": "J84-R1",
      "type": "particle.alignments2D",
      "group_name": "particles_selected",
      "name": "alignments2D",
      "title": "Particle 2D alignments",
      "description": "",
      "min_fields": [
        [
          "split",
          "u4"
        ],
        [
          "shift",
          "2f4"
        ],
        [
          "pose",
          "f4"
        ],
        [
          "psize_A",
          "f4"
        ],
        [
          "error",
          "f4"
        ],
        [
          "error_min",
          "f4"
        ],
        [
          "resid_pow",
          "f4"
        ],
        [
          "slice_pow",
          "f4"
        ],
        [
          "image_pow",
          "f4"
        ],
        [
          "cross_cor",
          "f4"
        ],
        [
          "alpha",
          "f4"
        ],
        [
          "alpha_min",
          "f4"
        ],
        [
          "weight",
          "f4"
        ],
        [
          "pose_ess",
          "f4"
        ],
        [
          "shift_ess",
          "f4"
        ],
        [
          "class_posterior",
          "f4"
        ],
        [
          "class",
          "u4"
        ],
        [
          "class_ess",
          "f4"
        ]
      ],
      "versions": [],
      "metafiles": [],
      "num_items": [],
      "passthrough": false
    },
    {
      "uid": "J84-R2",
      "type": "particle.ctf",
      "group_name": "particles_selected",
      "name": "ctf",
      "title": "Passthrough ctf",
      "description": "Passthrough from input particles.ctf (result_name)",
      "min_fields": [
        [
          "type",
          "O"
        ],
        [
          "exp_group_id",
          "u4"
        ],
        [
          "accel_kv",
          "f4"
        ],
        [
          "cs_mm",
          "f4"
        ],
        [
          "amp_contrast",
          "f4"
        ],
        [
          "df1_A",
          "f4"
        ],
        [
          "df2_A",
          "f4"
        ],
        [
          "df_angle_rad",
          "f4"
        ],
        [
          "phase_shift_rad",
          "f4"
        ],
        [
          "scale",
          "f4"
        ],
        [
          "scale_const",
          "f4"
        ],
        [
          "shift_A",
          "2f4"
        ],
        [
          "tilt_A",
          "2f4"
        ],
        [
          "trefoil_A",
          "2f4"
        ],
        [
          "tetra_A",
          "4f4"
        ],
        [
          "anisomag",
          "4f4"
        ],
        [
          "bfactor",
          "f4"
        ]
      ],
      "versions": [],
      "metafiles": [],
      "num_items": [],
      "passthrough": true
    },
    {
      "uid": "J84-R3",
      "type": "particle.location",
      "group_name": "particles_selected",
      "name": "location",
      "title": "Passthrough location",
      "description": "Passthrough from input particles.location (result_name)",
      "min_fields": [
        [
          "micrograph_uid",
          "u8"
        ],
        [
          "exp_group_id",
          "u4"
        ],
        [
          "micrograph_path",
          "O"
        ],
        [
          "micrograph_shape",
          "2u4"
        ],
        [
          "center_x_frac",
          "f4"
        ],
        [
          "center_y_frac",
          "f4"
        ]
      ],
      "versions": [],
      "metafiles": [],
      "num_items": [],
      "passthrough": true
    },
    {
      "uid": "J84-R4",
      "type": "particle.pick_stats",
      "group_name": "particles_selected",
      "name": "pick_stats",
      "title": "Passthrough pick_stats",
      "description": "Passthrough from input particles.pick_stats (result_name)",
      "min_fields": [
        [
          "ncc_score",
          "f4"
        ],
        [
          "power",
          "f4"
        ],
        [
          "template_idx",
          "u4"
        ],
        [
          "angle_rad",
          "f4"
        ]
      ],
      "versions": [],
      "metafiles": [],
      "num_items": [],
      "passthrough": true
    },
    {
      "uid": "J84-R5",
      "type": "template.blob",
      "group_name": "templates_all",
      "name": "blob",
      "title": "Template data",
      "description": "",
      "min_fields": [
        [
          "path",
          "O"
        ],
        [
          "idx",
          "u4"
        ],
        [
          "shape",
          "2u4"
        ],
        [
          "psize_A",
          "f4"
        ],
        [
          "res_A",
          "f4"
        ]
      ],
      "versions": [],
      "metafiles": [],
      "num_items": [],
      "passthrough": false
    },
    {
      "uid": "J84-R6",
      "type": "particle.blob",
      "group_name": "particles_excluded",
      "name": "blob",
      "title": "Particle data",
      "description": "",
      "min_fields": [
        [
          "path",
          "O"
        ],
        [
          "idx",
          "u4"
        ],
        [
          "shape",
          "2u4"
        ],
        [
          "psize_A",
          "f4"
        ],
        [
          "sign",
          "f4"
        ]
      ],
      "versions": [],
      "metafiles": [],
      "num_items": [],
      "passthrough": false
    },
    {
      "uid": "J84-R7",
      "type": "particle.alignments2D",
      "group_name": "particles_excluded",
      "name": "alignments2D",
      "title": "Particle 2D alignments",
      "description": "",
      "min_fields": [
        [
          "split",
          "u4"
        ],
        [
          "shift",
          "2f4"
        ],
        [
          "pose",
          "f4"
        ],
        [
          "psize_A",
          "f4"
        ],
        [
          "error",
          "f4"
        ],
        [
          "error_min",
          "f4"
        ],
        [
          "resid_pow",
          "f4"
        ],
        [
          "slice_pow",
          "f4"
        ],
        [
          "image_pow",
          "f4"
        ],
        [
          "cross_cor",
          "f4"
        ],
        [
          "alpha",
          "f4"
        ],
        [
          "alpha_min",
          "f4"
        ],
        [
          "weight",
          "f4"
        ],
        [
          "pose_ess",
          "f4"
        ],
        [
          "shift_ess",
          "f4"
        ],
        [
          "class_posterior",
          "f4"
        ],
        [
          "class",
          "u4"
        ],
        [
          "class_ess",
          "f4"
        ]
      ],
      "versions": [],
      "metafiles": [],
      "num_items": [],
      "passthrough": false
    },
    {
      "uid": "J84-R8",
      "type": "particle.ctf",
      "group_name": "particles_excluded",
      "name": "ctf",
      "title": "Passthrough ctf",
      "description": "Passthrough from input particles.ctf (result_name)",
      "min_fields": [
        [
          "type",
          "O"
        ],
        [
          "exp_group_id",
          "u4"
        ],
        [
          "accel_kv",
          "f4"
        ],
        [
          "cs_mm",
          "f4"
        ],
        [
          "amp_contrast",
          "f4"
        ],
        [
          "df1_A",
          "f4"
        ],
        [
          "df2_A",
          "f4"
        ],
        [
          "df_angle_rad",
          "f4"
        ],
        [
          "phase_shift_rad",
          "f4"
        ],
        [
          "scale",
          "f4"
        ],
        [
          "scale_const",
          "f4"
        ],
        [
          "shift_A",
          "2f4"
        ],
        [
          "tilt_A",
          "2f4"
        ],
        [
          "trefoil_A",
          "2f4"
        ],
        [
          "tetra_A",
          "4f4"
        ],
        [
          "anisomag",
          "4f4"
        ],
        [
          "bfactor",
          "f4"
        ]
      ],
      "versions": [],
      "metafiles": [],
      "num_items": [],
      "passthrough": true
    },
    {
      "uid": "J84-R9",
      "type": "particle.location",
      "group_name": "particles_excluded",
      "name": "location",
      "title": "Passthrough location",
      "description": "Passthrough from input particles.location (result_name)",
      "min_fields": [
        [
          "micrograph_uid",
          "u8"
        ],
        [
          "exp_group_id",
          "u4"
        ],
        [
          "micrograph_path",
          "O"
        ],
        [
          "micrograph_shape",
          "2u4"
        ],
        [
          "center_x_frac",
          "f4"
        ],
        [
          "center_y_frac",
          "f4"
        ]
      ],
      "versions": [],
      "metafiles": [],
      "num_items": [],
      "passthrough": true
    },
    {
      "uid": "J84-R10",
      "type": "particle.pick_stats",
      "group_name": "particles_excluded",
      "name": "pick_stats",
      "title": "Passthrough pick_stats",
      "description": "Passthrough from input particles.pick_stats (result_name)",
      "min_fields": [
        [
          "ncc_score",
          "f4"
        ],
        [
          "power",
          "f4"
        ],
        [
          "template_idx",
          "u4"
        ],
        [
          "angle_rad",
          "f4"
        ]
      ],
      "versions": [],
      "metafiles": [],
      "num_items": [],
      "passthrough": true
    }
      ],
      "output_group_images": {},
      "errors_build_params": {},
      "errors_build_inputs": {},
      "errors_run": [],
      "running_at": null,
      "token_acquired_at": null,
      "tokens_requested_at": null,
      "last_scheduled_at": null,
      "resources_needed": {},
      "resources_allocated": {
    "lane": "BP_gpu4_long",
    "lane_type": "BP_gpu4_long",
    "hostname": "BP_gpu4_long",
    "target": {
      "lane": "BP_gpu4_long",
      "qdel_cmd_tpl": "scancel {{ cluster_job_id }}",
      "worker_bin_path": "/gpfs/data/bhabhaekiertlabs/local_software/CryoSparc/cryosparc2_worker/bin/cryosparcw",
      "title": "BP_gpu4_long",
      "hostname": "BP_gpu4_long",
      "qstat_cmd_tpl": "squeue -j {{ cluster_job_id }}",
      "qinfo_cmd_tpl": "sinfo",
      "qsub_cmd_tpl": "sbatch {{ script_path_abs }}",
      "cache_path": "/gpfs/scratch/svc_bhabhaekiertlabs",
      "cache_quota_mb": null,
      "script_tpl": "#!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }}            - the complete command string to run the job\n## {{ num_cpu }}            - the number of CPUs needed\n## {{ num_gpu }}            - the number of GPUs needed. \n##                            Note: the code will use this many GPUs starting from dev id 0\n##                                  the cluster scheduler or this script have the responsibility\n##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n##                                  using the correct cluster-allocated GPUs.\n## {{ ram_gb }}             - the amount of RAM needed in GB\n## {{ job_dir_abs }}        - absolute path to the job directory\n## {{ project_dir_abs }}    - absolute path to the project dir\n## {{ job_log_path_abs }}   - absolute path to the log file for the job\n## {{ worker_bin_path }}    - absolute path to the cryosparc worker command\n## {{ run_args }}           - arguments to be passed to cryosparcw run\n## {{ project_uid }}        - uid of the project\n## {{ job_uid }}            - uid of the job\n##\n## What follows is a simple SLURM script:\n\n#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH -N 1\n#SBATCH -n {{ num_cpu }}\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH -p gpu4_long\n## #SBATCH --mem={{ (ram_gb*1000)|int }}MB             \n#SBATCH --mem-per-cpu={{ (ram_gb*2000/num_cpu)|int }}MB\n#SBATCH -o {{ job_dir_abs }}/cryosparc_{{ project_uid }}_{{ job_uid }}.out\n#SBATCH -e {{ job_dir_abs }}/cryosparc_{{ project_uid }}_{{ job_uid }}.err\n\navailable_devs=\"\"\nfor devidx in $(seq 0 15);\ndo\n    if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then\n        if [[ -z \"$available_devs\" ]] ; then\n            available_devs=$devidx\n        else\n            available_devs=$available_devs,$devidx\n        fi\n    fi\ndone\nexport CUDA_VISIBLE_DEVICES=$available_devs\n\n{{ run_cmd }}\n\n\n",
      "desc": null,
      "cache_reserve_mb": 80000,
      "type": "cluster",
      "send_cmd_tpl": "{{ command }}",
      "name": "BP_gpu4_long"
    },
    "slots": {},
    "fixed": {},
    "license": false,
    "licenses_acquired": 0
      },
      "run_on_master_direct": false,
      "queued_to_lane": "BP_gpu4_long",
      "queue_index": null,
      "queue_status": null,
      "queued_job_hash": null,
      "interactive_hostname": "BP_gpu4_long",
      "interactive_port": null,
      "PID_monitor": null,
      "PID_main": null,
      "PID_workers": [],
      "cluster_job_id": null,
      "is_experiment": false,
      "job_dir": "J84",
      "experiment_worker_path": null,
      "enable_bench": false,
      "bench": {},
      "project_uid_num": 151,
      "uid_num": 84,
      "ui_layouts": {
    "P151": {
      "show": true,
      "floater": false,
      "top": 4864,
      "left": 3276,
      "width": 298,
      "height": 192,
      "groups": []
    },
    "P151W3": {
      "show": true,
      "floater": false,
      "top": 4624,
      "left": 1844,
      "width": 298,
      "height": 192,
      "groups": []
    }
      },
      "last_exported": "2021-01-04T21:29:30.895Z",
      "queued_to_hostname": false,
      "queued_to_gpu": false,
      "no_check_inputs_ready": false,
      "num_tokens": 0,
      "job_sig": "3958326760272536805083087768582760114198856296476229231937835652505291382478773768455248373246139660490556284937580369924277352127122213869722981413752377673094671136165219840924534704215735635942407298820870695194900757793482790466124504926289296248903671760472275700334652632002472055064942395246825981711164141815388582962549719137852204939412937223682761236664748123167224756713769404916225615277149874933557962713702726538593795642026576046151737449622151533037658578688119625866277226345348921694799822195063415453408047396682715484874502647913396353339966147988633106264280771037266108507169913290170246273599",
      "tokens_acquired_at": 1609795779.9747014,
      "status_num": 15
    }

Hi @redler,

If the job directory for the problematic job still exists (and the job hasn’t been cleared), could you try running the submission command manually? This can be done by ssh’ing into the worker node, navigating to the job directory, then typing the submission command:

sbatch ./queue_sub_script.sh

To find the job directory, you can first look for the project directory by navigating to the project main page, clicking on the “Details” button, and scrolling down to the directory field (for example, see the screenshot below displaying the directory for Project P107). Once you have the project directory, the job directory will be a subdirectory within this, named with the title of the job (with this example, if the problematic Rebalance 2D job was J10, then it would be located at /u/cryosparcdev/cryosparc2_projects/P107/J10/).

Screen Shot 2021-01-04 at 3.11.15 PM

If you’ve cleared or deleted the job since it last failed to launch, would you be able to instead try launching a new job and repeating the above if it still doesn’t launch? Running the cluster submission script manually should produce a bit more detailed output that hopefully can show us where the error is occurring.

Best,
Michael

Hi @mmclean,

Thanks! My job directory does not contain the submission script file- the only contents are two files (events.bson and job.json) and a directory (gridfs_data). The job had not been cleared, and I initiated another identical job just to be sure- the submission script is not generated there either.

Best,
Rachel

Hello @redler,

Apologies for the delay in response, and thank you very much for providing this information. Based on this, I believe this is a bug and we’re aiming to fix this in our next release, which we aim to deploy this month.

Best,
Michael

Thanks for looking into this @mmclean! I’ll try again after we update to the next release.

Best,
Rachel

I see the same problem upon submitting “Reballance 2D Classes” jobs to SLURM.
Memory and number of CPU tasks are not set.
Manually editing these in the submit script and resubmit with sbatch from terminal makes it run.

Hi all,

Thank you for the responses and info. We have released a v3.1.0 update to cryoSPARC, with a fix that should resolve this issue. Please let us know if you are still encountering this error in v3.1.0.

Best,
Michael

1 Like