[Errno 2] No such file or directory during 2D classification

yodamoppet · March 5, 2021, 1:31pm

Hello.

Our researchers are having intermittent issues with 2D classification. Jobs are running in the pipeline, so there should not be any missing files, but we get the attached error. If we clear the job and re-run or lower the number of classes, if will then work.

What could be the root cause of this issue?

spunjani · March 5, 2021, 2:22pm

Hi @yodamoppet, thanks for posting. We are looking into this. Can you please copy-paste the text of the traceback for future searchability?

yodamoppet · March 5, 2021, 4:40pm

Excellent.

Certainly, here is the traceback text:

[CPU: 4.90 GB] Traceback (most recent call last):
File °/opt/cryoem/cryosparc/cryosparc2_worker/cryosparc_compute/jobs/runcommon.py”, line 1711, in run_with_except_hook
run_old(*args, **kw)
File “cryosparc_worker/cryosparc_compute/engine/cuda_core.py”, line 129, in cryosparc_compute.engine.cuda_core.GPUThread.run
File “cryosparc_worker/cryosparc_compute/engine/cuda_core.py”, line 136, in cryosparc_compute.engine.cuda_core.GPUThread.run
File “cryosparc_worker/cryosparc_compute/engine/engine.py”, line 997, in cryosparc_compute.engine.engine.process.work
File “cryosparc_worker/cryosparc_compute/engine/engine.py”, line 87, in cryosparc_compute.engine. engine. EngineThread. load_image_dat
File °/opt/cryoem/cryosparc/cryosparc2_worker/cryosparc_compute/particles.py”, line 168, in get_original_real_data
return self.blob.view().copy{)
File °/opt/cryoem/cryosparc/cryosparc2_worker/cryosparc_compute/blobio/mrc.py”, line 124, in view
return self.get(}
File °/opt/cryoem/cryosparc/cryosparc2_worker/cryosparc_compute/blobio/mrc.py”, line 119, in get
with open(self. fname) as file_obj:
FileNotFoundError: [Errno 2] Mo such file or directory: ‘/lLocal/instance_vision.structbio.pitt.edu:39061/projects/P29/J73/extract
/10178085504764670687_FoilHole_19857728_Data_19856882_19856884_20210219_234846_Fractions_patch_aligned_doseweighted_particles.mrc’

stephan · March 23, 2021, 3:05pm

Hi @yodamoppet,

Sorry for the delay. This traceback looks like the actual file from the cache folder was deleted. Is it possible someone on this same machine deleted all the files on that volume to make space for their work? Or is it possible there is a script or task that runs and clears out the cache volume intermittently?

yodamoppet · May 20, 2021, 5:57pm

Hi Stephen,

Sorry for the delay in getting back to this.

Yes, after tracing, this appears to be precisely what has happened. The cache drive filled and was deleted by a script monitoring this situation. I have disabled it for now.

I notice that there is an option “–ssdquota” which will allow me to limit the quota space the worker uses. Is this something that can be added to config.sh for the worked (i.e. “export SSDQUOTA=500GB” or similar), or do I need to recompile the worker to get this functionality?

Thanks!

stephan · May 25, 2021, 4:06pm

Hi @yodamoppet,

Yes, you can update a worker’s configuation easily by running the ./bin/cryosparcw connect command with the --update flag:

https://guide.cryosparc.com/setup-configuration-and-management/how-to-download-install-and-configure/downloading-and-installing-cryosparc#connect-the-standalone-worker-node-to-the-master-node

./bin/cryosparcw connect --worker <worker_hostname> \
                         --master <master_hostname> \
                         --port <port_num> \
                         --update \
                         --ssdquota <ssd_quota_mb>

yodamoppet · May 26, 2021, 5:27pm

So, in this case, using “–update” will set the ssdquota without messing with other parameters. For example:

./cryosparcw connect --update --ssdquota 1000000

to give the worker a quota of 1tb without adjusting any other previously configured parameters?

Thanks for the assistance!

stephan · May 26, 2021, 6:15pm

You will need to provide the worker, master and port arguments as well (to indicate which worker you’d like to update)

yodamoppet · May 28, 2021, 12:26pm

Very good, thank you.

Is there any way to output/verify what was used for worker, master, port?

Can this be done while jobs are running, or will it interrupt jobs?

stephan · May 28, 2021, 4:51pm

Hi @yodamoppet,

You can look at the “Instance Information” tab in the Resource Manager to see the argument for worker (Hostname):

The master and port arguments will be found in cryosparc_master/config.sh as CRYOSPARC_MASTER_HOSTNAME and CRYOSPARC_BASE_PORT respectively.

Yes, it won’t interrupt any running jobs.

yodamoppet · June 9, 2021, 1:34pm

Thanks for the details. I’ve mostly got this ready to try, but I’m not sure what to use for the --worker argument. This is a cluster system by the way. I don’t see workers listed as in your “Instance Information” tab (see image).

What in this case is the appropriate argument to use for “–worker” in the update command?

stephan · June 9, 2021, 9:52pm

Hey @yodamoppet,

In the case of a cluster, the way to update the configuration is different.
https://guide.cryosparc.com/setup-configuration-and-management/how-to-download-install-and-configure/downloading-and-installing-cryosparc#update-a-cluster-configuration

To update an existing cluster integration, call the cryosparcm cluster connect commands with the updated cluster_info.json and cluster_script.sh in the current working directory.

Note that the name field from cluster_info.json must be the same in the cluster configuration you’re trying to update

If you don’t already have the cluster_info.json and cluster_script.sh files in your current working directory, you can get them by running the command cryosparcm cluster dump <name>

yodamoppet · June 11, 2021, 12:49pm

Hello.

Thanks for the clarification. I have the cluster_info.json and cluster_script.sh files in the cryosparcm directory.

What arguments do I need to add to these files to limit the ssd quota? It looks like this would mainly go in cluster_info.json, right?

Current cluster_info.json is:

{
“name” : “vision”,
“worker_bin_path” : “/opt/cryoem/cryosparc/cryosparc2_worker/bin/cryosparcw”,
“cache_path” : “/local”,
“send_cmd_tpl” : “{{ command }}”,
“qsub_cmd_tpl” : “sbatch {{ script_path_abs }}”,
“qstat_cmd_tpl” : “squeue -j {{ cluster_job_id }}”,
“qdel_cmd_tpl” : “scancel {{ cluster_job_id }}”,
“qinfo_cmd_tpl” : “sinfo”,
“transfer_cmd_tpl” : “scp {{ src_path }} loginnode:{{ dest_path }}”
}

stephan · June 11, 2021, 1:02pm

Hi @yodamoppet,

Correct, in the cluster_info.json file, you can add a new field cache_quota_mb, with the value in MB.
For example, if you wanted a 1TB quota, your file will look like:

{
“name” : “vision”,
“worker_bin_path” : “/opt/cryoem/cryosparc/cryosparc2_worker/bin/cryosparcw”,
“cache_path” : “/local”,
“send_cmd_tpl” : “{{ command }}”,
“qsub_cmd_tpl” : “sbatch {{ script_path_abs }}”,
“qstat_cmd_tpl” : “squeue -j {{ cluster_job_id }}”,
“qdel_cmd_tpl” : “scancel {{ cluster_job_id }}”,
“qinfo_cmd_tpl” : “sinfo”,
“transfer_cmd_tpl” : “scp {{ src_path }} loginnode:{{ dest_path }}”,
"cache_quota_mb" : 1000000
}

yodamoppet · June 11, 2021, 1:10pm

Very good. Two additional questions:

As far as syntax, should there be a closing quote after the 1000000 ?

And, will this interrupt jobs, and do I need to stop or restart the cryoapracm process at any stage?

stephan · June 11, 2021, 2:08pm

Hi @yodamoppet,

The cryoSPARC function that reads this JSON file expects a number for this field, so quotes aren’t required.
View the fields info here, including their datatype:
https://guide.cryosparc.com/setup-configuration-and-management/how-to-download-install-and-configure/downloading-and-installing-cryosparc#create-the-files

This will not interrupt any currently running jobs. You can safely run this command while cryoSPARC is running and while jobs are running, no need to restart.

yodamoppet · June 16, 2021, 12:45pm

Very good, I have just executed this.

I get the a message after executing:

Successfully added cluster lane vision

Is that the correct completion message? I expected something like “successfully updated cluster lane vision”, but perhaps the message is the same whether adding a lane or updating it?