Job is unresponsive - no heartbeat received Error!

Whenver I run the Refinement or Classification/Variability jobs, they fail after 8-10 hours.

[2023-05-07 14:16:52.26]

Job is unresponsive - no heartbeat received in 60 seconds.

I read some of the other posts before making one of this topic, but still the error persists in my cryosparc.

I am using the Cryosparc v4.1.2, even before updating to this version I had this problem with our cryosparc.

1 in 10-15 jobs works well, the others fail. Usually, the jobs that fail with this error are the refinement and 3d classification jobs.

I asked our IT Specialist about it to understand if the error is from our side, he mentioned that “The Job is shown as complete from their side, and that’s the reason why cryosparc loses connection with the cluster”.

What could be a possible solution to this ?

Please let me know any other detail that I should add into this . I will edit and add those details as well.

Thank you everyone!

PS : Adding the Log as asked by @wtempel

================= CRYOSPARCW =======  2023-04-29 00:06:06.739307  =========
Project P38 Job J260
Master cryoemlic.uni-muenster.de Port 39002
===========================================================================
========= monitor process now starting main process at 2023-04-29 00:06:06.739452
MAINPROCESS PID 13178
MAIN PID 13178
refine.newrun cryosparc_compute.jobs.jobregister
========= monitor process now waiting for main process
========= sending heartbeat at 2023-04-29 00:06:41.880891
Deleted some lines similar to the above and below to get the whole Log into the required characters to paste in the forum
========= sending heartbeat at 2023-04-29 00:38:46.093365
***************************************************************
Running job  J260  of type  homo_refine_new
Running job on hostname %s gpua100:palma.uni-muenster.de
Allocated Resources :  {'fixed': {'SSD': False}, 'hostname': 'gpua100:palma.uni-muenster.de', 'lane': 'gpua100:palma.uni-muenster.de', 'lane_type': 'cluster', 'license': True, 'licenses_acquired': 1, 'slots': {'CPU': [0, 1, 2, 3], 'GPU': [0], 'RAM': [0, 1, 2]}, 'target': {'cache_path': '/scratch/tmp/cryospar/cache', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': ['num_gpu', 'num_cpu', 'ram_gb'], 'desc': None, 'hostname': 'gpua100:palma.uni-muenster.de', 'lane': 'gpua100:palma.uni-muenster.de', 'name': 'gpua100:palma.uni-muenster.de', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }}            - the complete command string to run the job\n## {{ num_cpu }}            - the number of CPUs needed\n## {{ num_gpu }}            - the number of GPUs needed. \n##                            Note: the code will use this many GPUs starting from dev id 0\n##                                  the cluster scheduler or this script have the responsibility\n##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n##                                  using the correct cluster-allocated GPUs.\n## {{ ram_gb }}             - the amount of RAM needed in GB\n## {{ job_dir_abs }}        - absolute path to the job directory\n## {{ project_dir_abs }}    - absolute path to the project dir\n## {{ job_log_path_abs }}   - absolute path to the log file for the job\n## {{ worker_bin_path }}    - absolute path to the cryosparc worker command\n## {{ run_args }}           - arguments to be passed to cryosparcw run\n## {{ project_uid }}        - uid of the project\n## {{ job_uid }}            - uid of the job\n## {{ job_creator }}        - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## What follows is a simple SLURM script:\n\n#SBATCH --job-name cryosparc_{{ job_creator }}_{{ project_uid }}_{{ job_uid }}\n#SBATCH -n {{ num_cpu }}\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH -p gpua100\n#SBATCH --mem={{ (num_gpu*60)|int }}G\n#SBATCH -o {{ job_dir_abs }}/{{ project_uid }}_{{ job_uid }}.out\n#SBATCH -e {{ job_dir_abs }}/{{ project_uid }}_{{ job_uid }}.err\nml palma/2021a\nml CUDA/11.6.0\n\navailable_devs=""\nfor devidx in $(seq 0 15);\ndo\n    if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then\n        if [[ -z "$available_devs" ]] ; then\n            available_devs=$devidx\n        else\n            available_devs=$available_devs,$devidx\n        fi\n    fi\ndone\nexport CUDA_VISIBLE_DEVICES=$available_devs\n\n{{ run_cmd }}\n\n', 'send_cmd_tpl': 'ssh -i ~/.ssh/id_rsa_palma_cryospar_VM cryospar@palma.uni-muenster.de {{ command }}', 'title': 'gpua100:palma.uni-muenster.de', 'tpl_vars': ['job_dir_abs', 'job_uid', 'cryosparc_username', 'cluster_job_id', 'project_dir_abs', 'job_creator', 'command', 'num_gpu', 'run_args', 'num_cpu', 'run_cmd', 'job_log_path_abs', 'project_uid', 'worker_bin_path', 'ram_gb'], 'type': 'cluster', 'worker_bin_path': '/home/c/cryospar/cryosparc_worker/download/cryosparc_worker/bin/cryosparcw'}}
HOST ALLOCATION FUNCTION: using cudrv.pagelocked_empty 
========= sending heartbeat at 2023-04-29 00:38:56.114365
Deleted some lines similar to the above and below to get the whole Log into the required characters to paste in the forum
========= sending heartbeat at 2023-04-29 08:43:29.814296
exception in cufft.Plan.__del__: 
exception in force_free_cufft_plan: 'NoneType' object has no attribute 'handle'
exception in force_free_cufft_plan: 'NoneType' object has no attribute 'handle'
exception in force_free_cufft_plan: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in force_free_cufft_plan: 'NoneType' object has no attribute 'handle'
exception in force_free_cufft_plan: 'NoneType' object has no attribute 'handle'
exception in force_free_cufft_plan: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in force_free_cufft_plan: 
exception in force_free_cufft_plan: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in force_free_cufft_plan: 'NoneType' object has no attribute 'handle'
exception in force_free_cufft_plan: 'NoneType' object has no attribute 'handle'
exception in cufft.Plan.__del__: 
exception in force_free_cufft_plan: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in force_free_cufft_plan: 'NoneType' object has no attribute 'handle'
exception in force_free_cufft_plan: 'NoneType' object has no attribute 'handle'
exception in cufft.Plan.__del__: 
exception in force_free_cufft_plan: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in force_free_cufft_plan: 
exception in force_free_cufft_plan: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in force_free_cufft_plan: 'NoneType' object has no attribute 'handle'
exception in force_free_cufft_plan: 'NoneType' object has no attribute 'handle'
exception in force_free_cufft_plan: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in force_free_cufft_plan: 'NoneType' object has no attribute 'handle'
exception in force_free_cufft_plan: 'NoneType' object has no attribute 'handle'
exception in force_free_cufft_plan: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in force_free_cufft_plan: 
exception in force_free_cufft_plan: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in force_free_cufft_plan: 'NoneType' object has no attribute 'handle'
exception in force_free_cufft_plan: 'NoneType' object has no attribute 'handle'
exception in force_free_cufft_plan: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in force_free_cufft_plan: 'NoneType' object has no attribute 'handle'
exception in force_free_cufft_plan: 'NoneType' object has no attribute 'handle'
exception in force_free_cufft_plan: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in force_free_cufft_plan: 
exception in force_free_cufft_plan: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in force_free_cufft_plan: 'NoneType' object has no attribute 'handle'
exception in force_free_cufft_plan: 'NoneType' object has no attribute 'handle'
exception in force_free_cufft_plan: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in force_free_cufft_plan: 'NoneType' object has no attribute 'handle'
exception in force_free_cufft_plan: 'NoneType' object has no attribute 'handle'
exception in cufft.Plan.__del__: 
exception in force_free_cufft_plan: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in force_free_cufft_plan: 
exception in force_free_cufft_plan: 
exception in force_free_cufft_plan: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in force_free_cufft_plan: 'NoneType' object has no attribute 'handle'
exception in force_free_cufft_plan: 'NoneType' object has no attribute 'handle'
exception in cufft.Plan.__del__: 
exception in force_free_cufft_plan: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in force_free_cufft_plan: 'NoneType' object has no attribute 'handle'
exception in force_free_cufft_plan: 'NoneType' object has no attribute 'handle'
exception in cufft.Plan.__del__: 
exception in force_free_cufft_plan: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in force_free_cufft_plan: 
exception in force_free_cufft_plan: 
exception in force_free_cufft_plan: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in force_free_cufft_plan: 'NoneType' object has no attribute 'handle'
exception in force_free_cufft_plan: 'NoneType' object has no attribute 'handle'
exception in force_free_cufft_plan: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in force_free_cufft_plan: 'NoneType' object has no attribute 'handle'
exception in force_free_cufft_plan: 'NoneType' object has no attribute 'handle'
exception in cufft.Plan.__del__: 
exception in force_free_cufft_plan: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in force_free_cufft_plan: 
exception in force_free_cufft_plan: 
exception in force_free_cufft_plan: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
---- Computing FSC with mask 2.00 to 6.00
exception in force_free_cufft_plan: 
exception in force_free_cufft_plan: 
exception in force_free_cufft_plan: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
exception in cufft.Plan.__del__: 
**** handle exception rc
set status to failed
/home/c/cryospar/cryosparc_worker/download/cryosparc_worker/cryosparc_compute/plotutil.py:553: RuntimeWarning: divide by zero encountered in log
  logabs = n.log(n.abs(fM))
/home/c/cryospar/cryosparc_worker/download/cryosparc_worker/cryosparc_compute/sigproc.py:653: FutureWarning: `rcond` parameter will change to the default of machine precision times ``max(M, N)`` where M and N are the input matrix dimensions.
To use the future default and silence this warning we advise to pass `rcond=None`, to keep using the old, explicitly pass `rcond=-1`.
  x = n.linalg.lstsq(w.reshape((-1,1))*A, w*b)[0]
/home/c/cryospar/cryosparc_worker/download/cryosparc_worker/cryosparc_compute/plotutil.py:553: RuntimeWarning: divide by zero encountered in log
  logabs = n.log(n.abs(fM))
/home/c/cryospar/cryosparc_worker/download/cryosparc_worker/cryosparc_compute/plotutil.py:29: RuntimeWarning: invalid value encountered in sqrt
  cradwn = n.sqrt(cradwn)
/home/c/cryospar/cryosparc_worker/download/cryosparc_worker/cryosparc_compute/sigproc.py:653: FutureWarning: `rcond` parameter will change to the default of machine precision times ``max(M, N)`` where M and N are the input matrix dimensions.
To use the future default and silence this warning we advise to pass `rcond=None`, to keep using the old, explicitly pass `rcond=-1`.
  x = n.linalg.lstsq(w.reshape((-1,1))*A, w*b)[0]
/home/c/cryospar/cryosparc_worker/download/cryosparc_worker/cryosparc_compute/plotutil.py:553: RuntimeWarning: divide by zero encountered in log
  logabs = n.log(n.abs(fM))
/home/c/cryospar/cryosparc_worker/download/cryosparc_worker/cryosparc_compute/plotutil.py:29: RuntimeWarning: invalid value encountered in sqrt
  cradwn = n.sqrt(cradwn)
/home/c/cryospar/cryosparc_worker/download/cryosparc_worker/cryosparc_compute/sigproc.py:653: FutureWarning: `rcond` parameter will change to the default of machine precision times ``max(M, N)`` where M and N are the input matrix dimensions.
To use the future default and silence this warning we advise to pass `rcond=None`, to keep using the old, explicitly pass `rcond=-1`.
  x = n.linalg.lstsq(w.reshape((-1,1))*A, w*b)[0]
/home/c/cryospar/cryosparc_worker/download/cryosparc_worker/cryosparc_compute/plotutil.py:553: RuntimeWarning: divide by zero encountered in log
  logabs = n.log(n.abs(fM))
/home/c/cryospar/cryosparc_worker/download/cryosparc_worker/cryosparc_compute/plotutil.py:29: RuntimeWarning: invalid value encountered in sqrt
  cradwn = n.sqrt(cradwn)
/home/c/cryospar/cryosparc_worker/download/cryosparc_worker/cryosparc_compute/sigproc.py:653: FutureWarning: `rcond` parameter will change to the default of machine precision times ``max(M, N)`` where M and N are the input matrix dimensions.
To use the future default and silence this warning we advise to pass `rcond=None`, to keep using the old, explicitly pass `rcond=-1`.
  x = n.linalg.lstsq(w.reshape((-1,1))*A, w*b)[0]
/home/c/cryospar/cryosparc_worker/download/cryosparc_worker/cryosparc_compute/plotutil.py:553: RuntimeWarning: divide by zero encountered in log
  logabs = n.log(n.abs(fM))
/home/c/cryospar/cryosparc_worker/download/cryosparc_worker/cryosparc_compute/plotutil.py:29: RuntimeWarning: invalid value encountered in sqrt
  cradwn = n.sqrt(cradwn)
/home/c/cryospar/cryosparc_worker/download/cryosparc_worker/cryosparc_compute/sigproc.py:953: RuntimeWarning: invalid value encountered in true_divide
  fsc_true = (fsc_t - fsc_n) / (1.0 - fsc_n)
/home/c/cryospar/cryosparc_worker/download/cryosparc_worker/cryosparc_compute/sigproc.py:653: FutureWarning: `rcond` parameter will change to the default of machine precision times ``max(M, N)`` where M and N are the input matrix dimensions.
To use the future default and silence this warning we advise to pass `rcond=None`, to keep using the old, explicitly pass `rcond=-1`.
  x = n.linalg.lstsq(w.reshape((-1,1))*A, w*b)[0]
/home/c/cryospar/cryosparc_worker/download/cryosparc_worker/cryosparc_compute/plotutil.py:553: RuntimeWarning: divide by zero encountered in log
  logabs = n.log(n.abs(fM))
/home/c/cryospar/cryosparc_worker/download/cryosparc_worker/cryosparc_compute/plotutil.py:29: RuntimeWarning: invalid value encountered in sqrt
  cradwn = n.sqrt(cradwn)
/home/c/cryospar/cryosparc_worker/download/cryosparc_worker/cryosparc_compute/sigproc.py:953: RuntimeWarning: invalid value encountered in true_divide
  fsc_true = (fsc_t - fsc_n) / (1.0 - fsc_n)
/home/c/cryospar/cryosparc_worker/download/cryosparc_worker/cryosparc_compute/sigproc.py:653: FutureWarning: `rcond` parameter will change to the default of machine precision times ``max(M, N)`` where M and N are the input matrix dimensions.
To use the future default and silence this warning we advise to pass `rcond=None`, to keep using the old, explicitly pass `rcond=-1`.
  x = n.linalg.lstsq(w.reshape((-1,1))*A, w*b)[0]
/home/c/cryospar/cryosparc_worker/download/cryosparc_worker/cryosparc_compute/plotutil.py:553: RuntimeWarning: divide by zero encountered in log
  logabs = n.log(n.abs(fM))
/home/c/cryospar/cryosparc_worker/download/cryosparc_worker/cryosparc_compute/plotutil.py:29: RuntimeWarning: invalid value encountered in sqrt
  cradwn = n.sqrt(cradwn)
/home/c/cryospar/cryosparc_worker/download/cryosparc_worker/cryosparc_compute/sigproc.py:953: RuntimeWarning: invalid value encountered in true_divide
  fsc_true = (fsc_t - fsc_n) / (1.0 - fsc_n)
/home/c/cryospar/cryosparc_worker/download/cryosparc_worker/cryosparc_compute/sigproc.py:653: FutureWarning: `rcond` parameter will change to the default of machine precision times ``max(M, N)`` where M and N are the input matrix dimensions.
To use the future default and silence this warning we advise to pass `rcond=None`, to keep using the old, explicitly pass `rcond=-1`.
  x = n.linalg.lstsq(w.reshape((-1,1))*A, w*b)[0]
/home/c/cryospar/cryosparc_worker/download/cryosparc_worker/cryosparc_compute/plotutil.py:553: RuntimeWarning: divide by zero encountered in log
  logabs = n.log(n.abs(fM))
/home/c/cryospar/cryosparc_worker/download/cryosparc_worker/cryosparc_compute/plotutil.py:29: RuntimeWarning: invalid value encountered in sqrt
  cradwn = n.sqrt(cradwn)
Process Process-1:
Traceback (most recent call last):
  File "cryosparc_master/cryosparc_compute/run.py", line 106, in cryosparc_compute.run.main
  File "/home/c/cryospar/cryosparc_worker/download/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 112, in func
    assert "error" not in res, f'Error for "{key}" with params {params}:\n' + format_server_error(res["error"])
AssertionError: Error for "dump_job_database" with params {'project_uid': 'P38', 'job_uid': 'J260', 'job_completed': True}:
ServerError: [Errno 116] Stale file handle: '/scratch/tmp/cryospar/Projects'
Traceback (most recent call last):
  File "/mnt/cryosparc/cryosparc_master/installation/cryosparc_master/cryosparc_command/commandcommon.py", line 200, in wrapper
    res = func(*args, **kwargs)
  File "/mnt/cryosparc/cryosparc_master/installation/cryosparc_master/cryosparc_command/commandcommon.py", line 266, in wrapper
    return func(*args, **kwargs)
  File "/mnt/cryosparc/cryosparc_master/installation/cryosparc_master/cryosparc_command/command_core/__init__.py", line 3613, in dump_job_database
    rc.dump_job_database(project_uid = project_uid, job_uid = job_uid, job_completed = job_completed, migration = migration, abs_export_dir = abs_export_dir, logger = logger)
  File "/mnt/cryosparc/cryosparc_master/installation/cryosparc_master/cryosparc_compute/jobs/runcommon.py", line 413, in dump_job_database
    mkdir_p(expanded_export_dir_abs)
  File "/mnt/cryosparc/cryosparc_master/installation/cryosparc_master/cryosparc_compute/jobs/runcommon.py", line 681, in mkdir_p
    os.makedirs(path, exist_ok=True)
  File "/mnt/cryosparc/cryosparc_master/installation/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/os.py", line 213, in makedirs
    makedirs(head, exist_ok=exist_ok)
  File "/mnt/cryosparc/cryosparc_master/installation/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/os.py", line 213, in makedirs
    makedirs(head, exist_ok=exist_ok)
  File "/mnt/cryosparc/cryosparc_master/installation/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/os.py", line 213, in makedirs
    makedirs(head, exist_ok=exist_ok)
  File "/mnt/cryosparc/cryosparc_master/installation/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/os.py", line 223, in makedirs
    mkdir(name, mode)
OSError: [Errno 116] Stale file handle: '/scratch/tmp/cryospar/Projects'


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/c/cryospar/cryosparc_worker/download/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/c/cryospar/cryosparc_worker/download/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "cryosparc_master/cryosparc_compute/run.py", line 120, in cryosparc_compute.run.main
  File "/home/c/cryospar/cryosparc_worker/download/cryosparc_worker/cryosparc_compute/jobs/runcommon.py", line 2029, in handle_exception
    set_job_status('failed')
  File "/home/c/cryospar/cryosparc_worker/download/cryosparc_worker/cryosparc_compute/jobs/runcommon.py", line 241, in set_job_status
    cli.set_job_status(_project_uid, _job_uid, status)
  File "/home/c/cryospar/cryosparc_worker/download/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 112, in func
    assert "error" not in res, f'Error for "{key}" with params {params}:\n' + format_server_error(res["error"])
AssertionError: Error for "set_job_status" with params ('P38', 'J260', 'failed'):
ServerError: validation error: lock file for P38 not found at /scratch/tmp/cryospar/Projects/CS-230206-protein-protein-buk-ks/cs.lock
Traceback (most recent call last):
  File "/mnt/cryosparc/cryosparc_master/installation/cryosparc_master/cryosparc_command/commandcommon.py", line 200, in wrapper
    res = func(*args, **kwargs)
  File "/mnt/cryosparc/cryosparc_master/installation/cryosparc_master/cryosparc_command/commandcommon.py", line 251, in wrapper
    assert os.path.isfile(
AssertionError: validation error: lock file for P38 not found at /scratch/tmp/cryospar/Projects/CS-230206-protein-protein-buk-ks/cs.lock

========= main process now complete at 2023-04-29 08:43:31.602845.
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "cryosparc_master/cryosparc_compute/run.py", line 217, in cryosparc_compute.run.run
  File "/home/c/cryospar/cryosparc_worker/download/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 112, in func
    assert "error" not in res, f'Error for "{key}" with params {params}:\n' + format_server_error(res["error"])
AssertionError: Error for "set_job_status" with params ('P38', 'J260', 'failed'):
ServerError: validation error: lock file for P38 not found at /scratch/tmp/cryospar/Projects/CS-230206-protein-protein-buk-ks/cs.lock
Traceback (most recent call last):
  File "/mnt/cryosparc/cryosparc_master/installation/cryosparc_master/cryosparc_command/commandcommon.py", line 200, in wrapper
    res = func(*args, **kwargs)
  File "/mnt/cryosparc/cryosparc_master/installation/cryosparc_master/cryosparc_command/commandcommon.py", line 251, in wrapper
    assert os.path.isfile(
AssertionError: validation error: lock file for P38 not found at /scratch/tmp/cryospar/Projects/CS-230206-protein-protein-buk-ks/cs.lock

As it’s a cluster, does it have a maximum run time defined for jobs?

Yes, it’s 7 days. These jobs get this error starting from 8 hours to sometimes after an entire day (24 hours)

Right, it’s not that, then. Sorry, wanted to check because I’ve seen short timeouts before now.

Yeah ! But what you asked was the very same question that came to my mind as well.I checked the submission scripts as well. So its for 7 days.

Please can you inspect the job log (accessible via Metadata|Log) for additional information, such as errors and heartbeat timestamps. Does the log show any errors, or that there was a gap in timestamps?

Total shot in the dark, but I’ve found that CryoSPARC chronically under-requests memory for jobs on the cluster, and that the error logs don’t tend to report it well. Typically it manifests in refinements, classifications, and sometimes motion corrections. I multiplied the reserved memory for each job by 1.5 or 2, and that resolved the issue.

1 Like

I have pasted the Log file now in the query. Please do let me know if I should provide any other details. Thank you very much !

There seems to be a problem with access to your project directory. Maybe the IT specialist can help you fix this problem, after which you could re-try the job.

Thank You ! Any specific thing that I should mention to the IT specialist ?

Maybe this excerpt from the log:

with an additional explanation that

/scratch/tmp/cryospar/Projects/CS-230206-protein-protein-buk-ks/

is (supposed to be) the project directory for your processing.

Sorry for the slow reply. Stale file handles are usually (at least in my experience) from NFS shares which have been lost temporarily. Either the remote data system was rebooted, or the network flaked briefly and recovered.

So I asked my IT specialist about this, as @wtempel had mentioned, there is this Err 116 of Stale File Handle. As @rbs_sci mentioned, rebooting or remounting helps, and so the IT Specialist added an argument such that every time CryoSPARC loses contact with the directory, we ask for a remounting to be done, which restores its connection with the directory and he also increased the time for the heartbeat from 60 to 600, which gives it enough time to remount and restore connection. The problem is solved now.

2 Likes