Hi all,
when running a Local Resolution estimation job, I get the following traceback:
All parameters are default.
The job runs if CPU is specified, albeit slowly.
[CPU: 1006.8 MB] Using local box size of 96 voxels.
[CPU: 1006.8 MB] Using zeropadded box size of 192 voxels.
[CPU: 1006.8 MB] Using step size of 1 voxels.
[CPU: 1006.8 MB] Using FSC threshold of 0.500.
[CPU: 1.05 GB] Started computing local resolution estimates.
[CPU: 1.05 GB] Number of voxels to compute: 3628483
[CPU: 1.56 GB] Traceback (most recent call last):
File "/soft/cryosparc_hodgkin/cryosparc_worker/cryosparc_compute/jobs/runcommon.py", line 1790, in run_with_except_hook
run_old(*args, **kw)
File "cryosparc_worker/cryosparc_compute/engine/cuda_core.py", line 131, in cryosparc_compute.engine.cuda_core.GPUThread.run
File "cryosparc_worker/cryosparc_compute/engine/cuda_core.py", line 132, in cryosparc_compute.engine.cuda_core.GPUThread.run
File "cryosparc_worker/cryosparc_compute/jobs/local_resolution/run.py", line 788, in cryosparc_compute.jobs.local_resolution.run.standalone_locres.work
File "/soft/cryosparc_hodgkin/cryosparc_worker/cryosparc_compute/skcuda_internal/fft.py", line 134, in __init__
onembed, ostride, odist, self.fft_type, self.batch)
File "/soft/cryosparc_hodgkin/cryosparc_worker/cryosparc_compute/skcuda_internal/cufft.py", line 749, in cufftMakePlanMany
cufftCheckStatus(status)
File "/soft/cryosparc_hodgkin/cryosparc_worker/cryosparc_compute/skcuda_internal/cufft.py", line 124, in cufftCheckStatus
raise e
cryosparc_compute.skcuda_internal.cufft.cufftInternalError
Current cryoSPARC version: v3.2.0
The job fails:
- either using SLURM or using cryosparc’s default queue
- tried on 32, 128 or 384 Gb RAM nodes
- on GTX1080, GTX 2070 SUPER, GTX 2080Ti
- Using NVIDIA driver 460 + CUDA 11.2 (Ubuntu 18)
- Using NVIDIA driver 465 + CUDA 11.3 (Ubuntu 18, Ubuntu 20)
And various combinations thereof.
All other cryosparc job types seem to be running fine on all machines / on SLURM nodes.
Further logs:
$>cryosparcm joblog P85 J105
Traceback (most recent call last):
File "/soft/cryosparc_hodgkin/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/soft/cryosparc_hodgkin/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/soft/cryosparc_hodgkin/cryosparc_master/cryosparc_compute/client.py", line 89, in <module>
print(eval("cli."+command))
File "<string>", line 1, in <module>
File "/soft/cryosparc_hodgkin/cryosparc_master/cryosparc_compute/client.py", line 62, in func
assert False, res['error']
AssertionError: {'code': 500, 'data': None, 'message': "OtherError: argument of type 'NoneType' is not iterable", 'name': 'OtherError'}
$>cryosparcm log command_core
[trimmed - in this case the job is being sent to a GTX 1080 with the Select GPU option of cryosparc]
[More logs are available for the other configurations, but the whole thing is quite hefty to attach]
[EXPORT_JOB] : Request to export P83 J105
[EXPORT_JOB] : Exporting job to /mnt/DATA/andrea/AM_20210709_XXX/P83/J105
[EXPORT_JOB] : Exporting all of job's images in the database to /mnt/DATA/andrea/AM_20210709_XXX/P83/J105/gridfs_data...
[EXPORT_JOB] : Done. Exported 0 images in 0.00s
[EXPORT_JOB] : Exporting all job's streamlog events...
[EXPORT_JOB] : Done. Exported 1 files in 0.00s
[EXPORT_JOB] : Exporting job metafile...
[EXPORT_JOB] : Done. Exported in 0.00s
[EXPORT_JOB] : Updating job manifest...
[EXPORT_JOB] : Done. Updated in 0.00s
[EXPORT_JOB] : Exported P83 J105 in 0.01s
---------- Scheduler running ---------------
Jobs Queued: [('P83', 'J105')]
Licenses currently active : 0
Now trying to schedule J105
Need slots : {'CPU': 2, 'GPU': 1, 'RAM': 1}
Need fixed : {'SSD': False}
Master direct : False
Running job directly on GPU id(s): [0] on hopper.rhpc.nki.nl
Not a commercial instance - heartbeat set to 12 hours.
Launchable! -- Launching.
Changed job P83.J105 status launched
Running project UID P83 job UID J105
Running job on worker type node
Running job using: /soft/cryosparc_hodgkin/cryosparc_worker/bin/cryosparcw
Running job on remote worker node hostname hopper.rhpc.nki.nl
cmd: bash -c "nohup /soft/cryosparc_hodgkin/cryosparc_worker/bin/cryosparcw run --project P83 --job J105 --master_hostname hodgkin.rhpc.nki.nl --master_command_core_port 39002 > /mnt/DATA/andrea/AM_20210709_XXX/P83/J105/job.log 2>&1 & "
---------- Scheduler finished ---------------
Changed job P83.J105 status started
Changed job P83.J105 status running
Changed job P83.J105 status failed
The database logs seem normal (they are quite hefty to attach though).
I can provide more info if necessary.
Thanks for your help, this one is a head scratcher!
Best,
Andrea