Junk Detector not working

Hi,

Junk Detector has stopped working, giving the attached error, even though other GPU-requiring jobs run without issue. Thoughts?

Error:

Traceback (most recent call last):
  File "cryosparc_master/cryosparc_compute/run.py", line 129, in cryosparc_master.cryosparc_compute.run.main
  File "cryosparc_master/cryosparc_compute/jobs/micrograph_analysis/run_junk_detector.py", line 329, in cryosparc_master.cryosparc_compute.jobs.micrograph_analysis.run_junk_detector.run_junk_detector_v1
  File "cryosparc_master/cryosparc_compute/jobs/micrograph_analysis/run_junk_detector.py", line 174, in cryosparc_master.cryosparc_compute.jobs.micrograph_analysis.run_junk_detector.infer
  File "cryosparc_master/cryosparc_compute/jobs/micrograph_analysis/run_junk_detector.py", line 176, in cryosparc_master.cryosparc_compute.jobs.micrograph_analysis.run_junk_detector.infer
  File "/home/user/software/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/software/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "cryosparc_master/cryosparc_compute/jobs/micrograph_analysis/junktransformer_v1.py", line 147, in cryosparc_master.cryosparc_compute.jobs.micrograph_analysis.junktransformer_v1.Transformer.forward
  File "/home/user/software/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/software/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/software/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/torch/nn/modules/transformer.py", line 678, in forward
    return torch._transformer_encoder_layer_fwd(
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling `cublasLtMatmulAlgoGetHeuristic( ltHandle, computeDesc.descriptor(), Adesc.descriptor(), Bdesc.descriptor(), Cdesc.descriptor(), Cdesc.descriptor(), preference.descriptor(), 1, &heuristicResult, &returnedResult)`

Cheers
Oli

May I ask:

  1. Has the Junk Detector worked previously with any data set on the same node and GPU where CUBLAS_STATUS_NOT_SUPPORTED was observed for this data set?
  2. Has the Junk Detector worked previously with this data set on any node?
  3. Has the nvidia driver recently been upgraded or downgraded on the node were CUBLAS_STATUS_NOT_SUPPORTED was now observed?

Has the Junk Detector worked previously with any data set on the same node and GPU where CUBLAS_STATUS_NOT_SUPPORTED was observed for this data set?

Yes - and those previously working jobs, when cloned, now give this error.

  1. Has the nvidia driver recently been upgraded or downgraded on the node were CUBLAS_STATUS_NOT_SUPPORTED was now observed?

I don’t believe so… but the current driver version from nvidia-smi is 535.183.01 - I’m not exactly sure how to see a history. It was working fine last week…

Getting the same exact error, on one system with nvidia 535.154.05 this is a 2x 3090 system with cuda 12.2. However have not been able to get Mic Junk Detector to work on this system

nvidia-installer log file ‘/var/log/nvidia-installer.log’
creation time: Fri Jan 19 14:10:15 2024
installer version: 535.154.05

@olibclarke @T_Bird Thanks for reporting this problem, which we have not seen before. If some of your Junk Detector jobs succeeded, you may want to compare the nvidia driver versions between failed and completed jobs:

csprojectid="P99"; csjobid="J199" # replace with actual project and job ids of relevant job
cryosparcm cli "get_job('$csprojectid', '$csjobid', 'type', 'version', 'errors_run', 'params_spec', 'status', 'instance_information.driver_version')"

Hi @wtempel,

For the job that succeeded:

{'_id': 'redacted', 'errors_run': [], 'instance_information': {'driver_version': '12.2'}, 'params_spec': {}, 'project_uid': 'P60', 'status': 'completed', 'type': 'junk_detector_v1', 'uid': 'J517', 'version': 'v4.7.0'}

For the one that failed:

{'_id': 'redacted', 'errors_run': [{'message': 'CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling `cublasLtMatmulAlgoGetHeuristic( ltHandle, computeDesc.descriptor(), Adesc.descriptor(), Bdesc.descriptor(), Cdesc.descriptor(), Cdesc.descriptor(), preference.descriptor(), 1, &heuristicResult, &returnedResult)`', 'warning': False}], 'instance_information': {'driver_version': '12.2'}, 'params_spec': {}, 'project_uid': 'P60', 'status': 'failed', 'type': 'junk_detector_v1', 'uid': 'J653', 'version': 'v4.7.0'}

However, it doesn’t seem to actually print the nvidia driver version per se, only the system CUDA version which is the same between the two jobs?

(this also seems to print out the license ID, which I have redacted from this output)

Thanks @olibclarke for posting these outputs, which is useful for our troubleshooting.

This should be the CUDA version associated with the nvidia-driver, not necessarily the version of a CUDA toolkit installed on the system. The driver version can be inferred (background).

I’d expect the '_id' printed by get_job() to be distinct from the license ID. Please can you confirm?

1 Like

Hello guys,

Finally I could update to 4.7 today, and same thing as @T_Bird, right on the first trial, junk detector crashing after 4 sec with the same cublas error. We are on a standalone server.

My Nvidia-installer log:
creation time: Fri Apr 19 15:55:42 2024
installer version: 525.125.06

If I understood correctly, this is not related to huge pages, right (mine are enabled)? And someone in the lab is suggesting me to update Cuda. Is that of any help?

Saw another post and tried adding in:

unset LD_LIBRARY_PATH
to cryopsarc/cryosparc_worker/config.sh

which got the job running on this system that could never get Micrograph Junk Detector to run on

Hi @olibclarke @T_Bird @carlos,

Thanks very much for reporting this and for weighing in. I have never seen this issue myself, but I can give you some pointers in terms of debugging.

First of all I can confirm that it is extremely unlikely this is in any way related to transparent huge pages. The most likely culprit is version mismatches between various CUDA libraries. LD_LIBRARY_PATH, cuda-related env variables (env | grep CU for example) driver version changes would be the first things to check. If it’s really the case that the same job that previously worked now fails when re-run, then there must have been some kind of environment change.

If the junk detector never worked correctly, then yes, I would check the cryosparc account’s .bashrc file and the cryosparc_worker/config.sh for any environment variables related to CUDA. CryoSPARC requires that the user install the nvidia driver (CryoSPARC Installation Prerequisites | CryoSPARC Guide) but modern cryoSPARC versions do not require that the user install the CUDA toolkit manually - the cryoSPARC installer takes care of this. A manual installation of the CUDA toolkit, particularly coupled with setting environment variables that override which version of various CUDA components are used, could cause this problem. But a manual installation of the CUDA toolkit (as opposed to the driver) will not fix this.

Another possibility, in cases where the failure is dataset dependent, is that specific magnification levels that we did not encounter in testing cause micrograph dimensions that are outside the support surface for cuBLAS. This is only possible for data sets where the micrograph physical extent (pixel size * pixel count) in either the X or Y axis is less than about 3200A. If you think you’re in this situation, please let me know.

–Harris

1 Like