Hello,
After updating our master/client nodes to v3.1, I have been running into this issue. For testing, I am using the t20s dateset.
[CPU: 1.01 GB] Traceback (most recent call last):
File "cryosparc2_compute/jobs/runcommon.py", line 1685, in run_with_except_hook
run_old(*args, **kw)
File "cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py", line 110, in cryosparc2_compute.engine.cuda_core.GPUThread.run
File "cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py", line 111, in cryosparc2_compute.engine.cuda_core.GPUThread.run
File "cryosparc2_worker/cryosparc2_compute/engine/engine.py", line 1011, in cryosparc2_compute.engine.engine.process.work
File "cryosparc2_worker/cryosparc2_compute/engine/engine.py", line 175, in cryosparc2_compute.engine.engine.EngineThread.setup_current_data_and_ctf
File "cryosparc2_worker/cryosparc2_compute/engine/cuda_kernels.py", line 1732, in cryosparc2_compute.engine.cuda_kernels.extract_fourier_2D
File "/cryosparc/worker/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/pycuda/driver.py", line 382, in function_call
func._set_block_shape(*block)
LogicError: cuFuncSetBlockShape failed: invalid resource handle
All the nodes are updated to cuda 11.2 with nvidia driver 460.xx
I was able to fix this after downgrading to cuda 10.
I will report back if there are more issues.
I like how flexible it is to switch cuda versions via cryosparcw newcuda <cudapath>
2 Likes
I ran into another problem and I believe its related to the scratch/cache directory configuration.
All jobs fail when ssd caching is set to be true with the error below.
I have tried removing workers via the icli interface and adding them again but they all stop with the same error.
[CPU: 90.1 MB] Project P36 Job J67 Started
[CPU: 90.1 MB] Master running v3.1.0, worker running v3.1.0
[CPU: 90.3 MB] Running on lane shark
[CPU: 90.3 MB] Resources allocated:
[CPU: 90.3 MB] Worker: shark.qb3.berkeley.edu
[CPU: 90.3 MB] CPU : [0, 1]
[CPU: 90.4 MB] GPU : [0]
[CPU: 90.4 MB] RAM : [0, 1, 2]
[CPU: 90.4 MB] SSD : True
[CPU: 90.4 MB] --------------------------------------------------------------
[CPU: 90.4 MB] Importing job module for job type class_2D...
[CPU: 405.8 MB] Job ready to run
[CPU: 405.8 MB] ***************************************************************
[CPU: 406.1 MB] Using random seed of 1623841468
[CPU: 406.1 MB] Loading a ParticleStack with 11 items...
[CPU: 406.1 MB] SSD cache : cache successfuly synced in_use
[CPU: 406.1 MB] Traceback (most recent call last):
File "cryosparc2_worker/cryosparc2_compute/run.py", line 85, in cryosparc2_compute.run.main
File "cryosparc2_worker/cryosparc2_compute/jobs/class2D/run.py", line 64, in cryosparc2_compute.jobs.class2D.run.run_class_2D
File "cryosparc2_compute/particles.py", line 61, in read_blobs
u_blob_paths = cache.download_and_return_cache_paths(u_rel_paths)
File "cryosparc2_compute/jobs/cache.py", line 114, in download_and_return_cache_paths
used_mb = sync_hits(worker_hostname, ssd_cache_path, instance_id)
File "cryosparc2_compute/jobs/cache.py", line 191, in sync_hits
rc.cli.cache_sync_hits(worker_hostname, keys, sizes_mb)
File "cryosparc2_compute/client.py", line 57, in func
assert False, res['error']
AssertionError: {u'message': u"OtherError: argument should be a bytes-like object or ASCII string, not 'list'", u'code': 500, u'data': None, u'name': u'OtherError'}
Permissions look good too.
drwxrwxr-x 3 cryosparc nogales-current 4096 Mar 5 18:00 cryosparc-scratch
Any ideas?
HI @achintangal,
It seems like somehow the cryoSPARC update/installation failed.
Re-install cryoSPARC:
cryosparcm update --override
@Stephan,
I am ran the update process again and I am still running into the same error. Here is the log from command_core
Jobs Queued: [('P36', 'J72')]
Licenses currently active : 1
Now trying to schedule J72
Need slots : {'CPU': 2, 'GPU': 1, 'RAM': 3}
Need fixed : {'SSD': True}
Master direct : False
Scheduling job to albakor.x.x.
Failed to connect link: HTTP Error 502: Bad Gateway
Not a commercial instance - heartbeat set to 12 hours.
Launchable! -- Launching.
Changed job P36.J72 status launched
Running project UID P36 job UID J72
Running job on worker type node
Running job using: /opt/cryosparc-v2/cryosparc2_worker/bin/cryosparcw
---------- Scheduler finished ---------------
Changed job P36.J72 status started
Changed job P36.J72 status running
[JSONRPC ERROR 2021-03-08 11:29:40.493158 at cache_sync_hits ]
-----------------------------------------------------
Traceback (most recent call last):
File "/opt/cryosparc-v2/cryosparc2_master/cryosparc_command/command_core/__init__.py", line 115, in wrapper
res = func(*args, **kwargs)
File "/opt/cryosparc-v2/cryosparc2_master/cryosparc_command/command_core/__init__.py", line 2602, in cache_sync_hits
keys = com.decompress_paths(compressed_keys)
File "/opt/cryosparc-v2/cryosparc2_master/cryosparc_compute/jobs/common.py", line 577, in decompress_paths
return pickle.loads(zlib.decompress(base64.b64decode(paths)))
File "/opt/cryosparc-v2/cryosparc2_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/base64.py", line 80, in b64decode
s = _bytes_from_decode_data(s)
File "/opt/cryosparc-v2/cryosparc2_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/base64.py", line 46, in _bytes_from_decode_data
"string, not %r" % s.__class__.__name__) from None
TypeError: argument should be a bytes-like object or ASCII string, not 'list'
That’s odd, this still seems like the worker update failed.
Can you run the following command and paste the output of the command here?
- Navigate to cryosparc_worker
- Run:
cryosparcw update --override
1 Like
Stephan,
It worked! While running the update on the worker node, I ran into some permissions errors. After fixing those, its back in business.
Strangely enough, I didn’t see any such errors while running the command on the master. The update process went through all the registered worker nodes with no errors. Perhaps, these errors are not communicated via ssh.
Thanks a lot for your help.
1 Like