Junk Detector has stopped working, giving the attached error, even though other GPU-requiring jobs run without issue. Thoughts?
Error:
Traceback (most recent call last):
File "cryosparc_master/cryosparc_compute/run.py", line 129, in cryosparc_master.cryosparc_compute.run.main
File "cryosparc_master/cryosparc_compute/jobs/micrograph_analysis/run_junk_detector.py", line 329, in cryosparc_master.cryosparc_compute.jobs.micrograph_analysis.run_junk_detector.run_junk_detector_v1
File "cryosparc_master/cryosparc_compute/jobs/micrograph_analysis/run_junk_detector.py", line 174, in cryosparc_master.cryosparc_compute.jobs.micrograph_analysis.run_junk_detector.infer
File "cryosparc_master/cryosparc_compute/jobs/micrograph_analysis/run_junk_detector.py", line 176, in cryosparc_master.cryosparc_compute.jobs.micrograph_analysis.run_junk_detector.infer
File "/home/user/software/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/software/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "cryosparc_master/cryosparc_compute/jobs/micrograph_analysis/junktransformer_v1.py", line 147, in cryosparc_master.cryosparc_compute.jobs.micrograph_analysis.junktransformer_v1.Transformer.forward
File "/home/user/software/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/software/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/software/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/torch/nn/modules/transformer.py", line 678, in forward
return torch._transformer_encoder_layer_fwd(
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling `cublasLtMatmulAlgoGetHeuristic( ltHandle, computeDesc.descriptor(), Adesc.descriptor(), Bdesc.descriptor(), Cdesc.descriptor(), Cdesc.descriptor(), preference.descriptor(), 1, &heuristicResult, &returnedResult)`
Has the Junk Detector worked previously with any data set on the same node and GPU where CUBLAS_STATUS_NOT_SUPPORTED was observed for this data set?
Yes - and those previously working jobs, when cloned, now give this error.
Has the nvidia driver recently been upgraded or downgraded on the node were CUBLAS_STATUS_NOT_SUPPORTED was now observed?
I don’t believe so… but the current driver version from nvidia-smi is 535.183.01 - I’m not exactly sure how to see a history. It was working fine last week…
Getting the same exact error, on one system with nvidia 535.154.05 this is a 2x 3090 system with cuda 12.2. However have not been able to get Mic Junk Detector to work on this system
@olibclarke@T_Bird Thanks for reporting this problem, which we have not seen before. If some of your Junk Detector jobs succeeded, you may want to compare the nvidia driver versions between failed and completed jobs:
csprojectid="P99"; csjobid="J199" # replace with actual project and job ids of relevant job
cryosparcm cli "get_job('$csprojectid', '$csjobid', 'type', 'version', 'errors_run', 'params_spec', 'status', 'instance_information.driver_version')"
Thanks @olibclarke for posting these outputs, which is useful for our troubleshooting.
This should be the CUDA version associated with the nvidia-driver, not necessarily the version of a CUDA toolkit installed on the system. The driver version can be inferred (background).
I’d expect the '_id' printed by get_job() to be distinct from the license ID. Please can you confirm?
Finally I could update to 4.7 today, and same thing as @T_Bird, right on the first trial, junk detector crashing after 4 sec with the same cublas error. We are on a standalone server.
If I understood correctly, this is not related to huge pages, right (mine are enabled)? And someone in the lab is suggesting me to update Cuda. Is that of any help?
Thanks very much for reporting this and for weighing in. I have never seen this issue myself, but I can give you some pointers in terms of debugging.
First of all I can confirm that it is extremely unlikely this is in any way related to transparent huge pages. The most likely culprit is version mismatches between various CUDA libraries. LD_LIBRARY_PATH, cuda-related env variables (env | grep CU for example) driver version changes would be the first things to check. If it’s really the case that the same job that previously worked now fails when re-run, then there must have been some kind of environment change.
If the junk detector never worked correctly, then yes, I would check the cryosparc account’s .bashrc file and the cryosparc_worker/config.sh for any environment variables related to CUDA. CryoSPARC requires that the user install the nvidia driver (CryoSPARC Installation Prerequisites | CryoSPARC Guide) but modern cryoSPARC versions do not require that the user install the CUDA toolkit manually - the cryoSPARC installer takes care of this. A manual installation of the CUDA toolkit, particularly coupled with setting environment variables that override which version of various CUDA components are used, could cause this problem. But a manual installation of the CUDA toolkit (as opposed to the driver) will not fix this.
Another possibility, in cases where the failure is dataset dependent, is that specific magnification levels that we did not encounter in testing cause micrograph dimensions that are outside the support surface for cuBLAS. This is only possible for data sets where the micrograph physical extent (pixel size * pixel count) in either the X or Y axis is less than about 3200A. If you think you’re in this situation, please let me know.