Hello.
Our researchers are having intermittent issues with 2D classification. Jobs are running in the pipeline, so there should not be any missing files, but we get the attached error. If we clear the job and re-run or lower the number of classes, if will then work.
What could be the root cause of this issue?
Hi @yodamoppet, thanks for posting. We are looking into this. Can you please copy-paste the text of the traceback for future searchability?
Excellent.
Certainly, here is the traceback text:
[CPU: 4.90 GB] Traceback (most recent call last):
File °/opt/cryoem/cryosparc/cryosparc2_worker/cryosparc_compute/jobs/runcommon.pyâ, line 1711, in run_with_except_hook
run_old(*args, **kw)
File âcryosparc_worker/cryosparc_compute/engine/cuda_core.pyâ, line 129, in cryosparc_compute.engine.cuda_core.GPUThread.run
File âcryosparc_worker/cryosparc_compute/engine/cuda_core.pyâ, line 136, in cryosparc_compute.engine.cuda_core.GPUThread.run
File âcryosparc_worker/cryosparc_compute/engine/engine.pyâ, line 997, in cryosparc_compute.engine.engine.process.work
File âcryosparc_worker/cryosparc_compute/engine/engine.pyâ, line 87, in cryosparc_compute.engine. engine. EngineThread. load_image_dat
File °/opt/cryoem/cryosparc/cryosparc2_worker/cryosparc_compute/particles.pyâ, line 168, in get_original_real_data
return self.blob.view().copy{)
File °/opt/cryoem/cryosparc/cryosparc2_worker/cryosparc_compute/blobio/mrc.pyâ, line 124, in view
return self.get(}
File °/opt/cryoem/cryosparc/cryosparc2_worker/cryosparc_compute/blobio/mrc.pyâ, line 119, in get
with open(self. fname) as file_obj:
FileNotFoundError: [Errno 2] Mo such file or directory: â/lLocal/instance_vision.structbio.pitt.edu:39061/projects/P29/J73/extract
/10178085504764670687_FoilHole_19857728_Data_19856882_19856884_20210219_234846_Fractions_patch_aligned_doseweighted_particles.mrcâ
Hi @yodamoppet,
Sorry for the delay. This traceback looks like the actual file from the cache folder was deleted. Is it possible someone on this same machine deleted all the files on that volume to make space for their work? Or is it possible there is a script or task that runs and clears out the cache volume intermittently?
Hi Stephen,
Sorry for the delay in getting back to this.
Yes, after tracing, this appears to be precisely what has happened. The cache drive filled and was deleted by a script monitoring this situation. I have disabled it for now.
I notice that there is an option ââssdquotaâ which will allow me to limit the quota space the worker uses. Is this something that can be added to config.sh for the worked (i.e. âexport SSDQUOTA=500GBâ or similar), or do I need to recompile the worker to get this functionality?
Thanks!
Hi @yodamoppet,
Yes, you can update a workerâs configuation easily by running the ./bin/cryosparcw
connect command with the --update
flag:
https://guide.cryosparc.com/setup-configuration-and-management/how-to-download-install-and-configure/downloading-and-installing-cryosparc#connect-the-standalone-worker-node-to-the-master-node
./bin/cryosparcw connect --worker <worker_hostname> \
--master <master_hostname> \
--port <port_num> \
--update \
--ssdquota <ssd_quota_mb>
So, in this case, using ââupdateâ will set the ssdquota without messing with other parameters. For example:
./cryosparcw connect --update --ssdquota 1000000
to give the worker a quota of 1tb without adjusting any other previously configured parameters?
Thanks for the assistance!
You will need to provide the worker
, master
and port
arguments as well (to indicate which worker youâd like to update)
Very good, thank you.
Is there any way to output/verify what was used for worker, master, port?
Can this be done while jobs are running, or will it interrupt jobs?
Hi @yodamoppet,
You can look at the âInstance Informationâ tab in the Resource Manager to see the argument for worker
(Hostname
):
The master
and port
arguments will be found in cryosparc_master/config.sh as CRYOSPARC_MASTER_HOSTNAME
and CRYOSPARC_BASE_PORT
respectively.
Yes, it wonât interrupt any running jobs.
Thanks for the details. Iâve mostly got this ready to try, but Iâm not sure what to use for the --worker argument. This is a cluster system by the way. I donât see workers listed as in your âInstance Informationâ tab (see image).
What in this case is the appropriate argument to use for ââworkerâ in the update command?
Hey @yodamoppet,
In the case of a cluster, the way to update the configuration is different.
https://guide.cryosparc.com/setup-configuration-and-management/how-to-download-install-and-configure/downloading-and-installing-cryosparc#update-a-cluster-configuration
To update an existing cluster integration, call the cryosparcm cluster connect
commands with the updated cluster_info.json
and cluster_script.sh
in the current working directory.
Note that the name
field from cluster_info.json
must be the same in the cluster configuration youâre trying to update
If you donât already have the cluster_info.json
and cluster_script.sh
files in your current working directory, you can get them by running the command cryosparcm cluster dump <name>
1 Like
Hello.
Thanks for the clarification. I have the cluster_info.json and cluster_script.sh files in the cryosparcm directory.
What arguments do I need to add to these files to limit the ssd quota? It looks like this would mainly go in cluster_info.json, right?
Current cluster_info.json is:
{
ânameâ : âvisionâ,
âworker_bin_pathâ : â/opt/cryoem/cryosparc/cryosparc2_worker/bin/cryosparcwâ,
âcache_pathâ : â/localâ,
âsend_cmd_tplâ : â{{ command }}â,
âqsub_cmd_tplâ : âsbatch {{ script_path_abs }}â,
âqstat_cmd_tplâ : âsqueue -j {{ cluster_job_id }}â,
âqdel_cmd_tplâ : âscancel {{ cluster_job_id }}â,
âqinfo_cmd_tplâ : âsinfoâ,
âtransfer_cmd_tplâ : âscp {{ src_path }} loginnode:{{ dest_path }}â
}
Hi @yodamoppet,
Correct, in the cluster_info.json
file, you can add a new field cache_quota_mb
, with the value in MB.
For example, if you wanted a 1TB quota, your file will look like:
{
ânameâ : âvisionâ,
âworker_bin_pathâ : â/opt/cryoem/cryosparc/cryosparc2_worker/bin/cryosparcwâ,
âcache_pathâ : â/localâ,
âsend_cmd_tplâ : â{{ command }}â,
âqsub_cmd_tplâ : âsbatch {{ script_path_abs }}â,
âqstat_cmd_tplâ : âsqueue -j {{ cluster_job_id }}â,
âqdel_cmd_tplâ : âscancel {{ cluster_job_id }}â,
âqinfo_cmd_tplâ : âsinfoâ,
âtransfer_cmd_tplâ : âscp {{ src_path }} loginnode:{{ dest_path }}â,
"cache_quota_mb" : 1000000
}
Very good. Two additional questions:
As far as syntax, should there be a closing quote after the 1000000 ?
And, will this interrupt jobs, and do I need to stop or restart the cryoapracm process at any stage?
Hi @yodamoppet,
The cryoSPARC function that reads this JSON file expects a number for this field, so quotes arenât required.
View the fields info here, including their datatype:
https://guide.cryosparc.com/setup-configuration-and-management/how-to-download-install-and-configure/downloading-and-installing-cryosparc#create-the-files
This will not interrupt any currently running jobs. You can safely run this command while cryoSPARC is running and while jobs are running, no need to restart.
Very good, I have just executed this.
I get the a message after executing:
Successfully added cluster lane vision
Is that the correct completion message? I expected something like âsuccessfully updated cluster lane visionâ, but perhaps the message is the same whether adding a lane or updating it?