2D Classification Job Fails after 3.1 Update with invalid resource handle

achintangal · March 5, 2021, 3:36am

Hello,

After updating our master/client nodes to v3.1, I have been running into this issue. For testing, I am using the t20s dateset.

[CPU: 1.01 GB]   Traceback (most recent call last):
  File "cryosparc2_compute/jobs/runcommon.py", line 1685, in run_with_except_hook
    run_old(*args, **kw)
  File "cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py", line 110, in cryosparc2_compute.engine.cuda_core.GPUThread.run
  File "cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py", line 111, in cryosparc2_compute.engine.cuda_core.GPUThread.run
  File "cryosparc2_worker/cryosparc2_compute/engine/engine.py", line 1011, in cryosparc2_compute.engine.engine.process.work
  File "cryosparc2_worker/cryosparc2_compute/engine/engine.py", line 175, in cryosparc2_compute.engine.engine.EngineThread.setup_current_data_and_ctf
  File "cryosparc2_worker/cryosparc2_compute/engine/cuda_kernels.py", line 1732, in cryosparc2_compute.engine.cuda_kernels.extract_fourier_2D
  File "/cryosparc/worker/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/pycuda/driver.py", line 382, in function_call
    func._set_block_shape(*block)
LogicError: cuFuncSetBlockShape failed: invalid resource handle

All the nodes are updated to cuda 11.2 with nvidia driver 460.xx

achintangal · March 5, 2021, 7:09pm

I was able to fix this after downgrading to cuda 10.
I will report back if there are more issues.

I like how flexible it is to switch cuda versions via cryosparcw newcuda <cudapath>

achintangal · March 6, 2021, 2:05am

I ran into another problem and I believe its related to the scratch/cache directory configuration.
All jobs fail when ssd caching is set to be true with the error below.

I have tried removing workers via the icli interface and adding them again but they all stop with the same error.

[CPU: 90.1 MB]   Project P36 Job J67 Started

[CPU: 90.1 MB]   Master running v3.1.0, worker running v3.1.0

[CPU: 90.3 MB]   Running on lane shark

[CPU: 90.3 MB]   Resources allocated: 

[CPU: 90.3 MB]     Worker:  shark.qb3.berkeley.edu

[CPU: 90.3 MB]     CPU   :  [0, 1]

[CPU: 90.4 MB]     GPU   :  [0]

[CPU: 90.4 MB]     RAM   :  [0, 1, 2]

[CPU: 90.4 MB]     SSD   :  True

[CPU: 90.4 MB]   --------------------------------------------------------------

[CPU: 90.4 MB]   Importing job module for job type class_2D...

[CPU: 405.8 MB]  Job ready to run

[CPU: 405.8 MB]  ***************************************************************

[CPU: 406.1 MB]  Using random seed of 1623841468

[CPU: 406.1 MB]  Loading a ParticleStack with 11 items...

[CPU: 406.1 MB]   SSD cache : cache successfuly synced in_use
[CPU: 406.1 MB]  Traceback (most recent call last):
  File "cryosparc2_worker/cryosparc2_compute/run.py", line 85, in cryosparc2_compute.run.main
  File "cryosparc2_worker/cryosparc2_compute/jobs/class2D/run.py", line 64, in cryosparc2_compute.jobs.class2D.run.run_class_2D
  File "cryosparc2_compute/particles.py", line 61, in read_blobs
    u_blob_paths = cache.download_and_return_cache_paths(u_rel_paths)
  File "cryosparc2_compute/jobs/cache.py", line 114, in download_and_return_cache_paths
    used_mb = sync_hits(worker_hostname, ssd_cache_path, instance_id)
  File "cryosparc2_compute/jobs/cache.py", line 191, in sync_hits
    rc.cli.cache_sync_hits(worker_hostname, keys, sizes_mb)
  File "cryosparc2_compute/client.py", line 57, in func
    assert False, res['error']
AssertionError: {u'message': u"OtherError: argument should be a bytes-like object or ASCII string, not 'list'", u'code': 500, u'data': None, u'name': u'OtherError'}

Permissions look good too.

drwxrwxr-x 3 cryosparc nogales-current 4096 Mar 5 18:00 cryosparc-scratch

Any ideas?

stephan · March 8, 2021, 4:58pm

HI @achintangal,

It seems like somehow the cryoSPARC update/installation failed.
Re-install cryoSPARC:
cryosparcm update --override

achintangal · March 8, 2021, 7:36pm

@Stephan,

I am ran the update process again and I am still running into the same error. Here is the log from command_core

Jobs Queued:  [('P36', 'J72')]
Licenses currently active : 1
Now trying to schedule J72
  Need slots :  {'CPU': 2, 'GPU': 1, 'RAM': 3}
  Need fixed :  {'SSD': True}
  Master direct :  False
   Scheduling job to albakor.x.x.
Failed to connect link: HTTP Error 502: Bad Gateway
Not a commercial instance - heartbeat set to 12 hours.
     Launchable! -- Launching.
Changed job P36.J72 status launched
      Running project UID P36 job UID J72 
        Running job on worker type node
        Running job using:  /opt/cryosparc-v2/cryosparc2_worker/bin/cryosparcw
---------- Scheduler finished --------------- 
Changed job P36.J72 status started
Changed job P36.J72 status running
[JSONRPC ERROR  2021-03-08 11:29:40.493158  at  cache_sync_hits ]
-----------------------------------------------------
Traceback (most recent call last):
  File "/opt/cryosparc-v2/cryosparc2_master/cryosparc_command/command_core/__init__.py", line 115, in wrapper
    res = func(*args, **kwargs)
  File "/opt/cryosparc-v2/cryosparc2_master/cryosparc_command/command_core/__init__.py", line 2602, in cache_sync_hits
    keys = com.decompress_paths(compressed_keys)
  File "/opt/cryosparc-v2/cryosparc2_master/cryosparc_compute/jobs/common.py", line 577, in decompress_paths
    return pickle.loads(zlib.decompress(base64.b64decode(paths)))
  File "/opt/cryosparc-v2/cryosparc2_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/base64.py", line 80, in b64decode
    s = _bytes_from_decode_data(s)
  File "/opt/cryosparc-v2/cryosparc2_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/base64.py", line 46, in _bytes_from_decode_data
    "string, not %r" % s.__class__.__name__) from None
TypeError: argument should be a bytes-like object or ASCII string, not 'list'

stephan · March 8, 2021, 7:51pm

That’s odd, this still seems like the worker update failed.
Can you run the following command and paste the output of the command here?

Navigate to cryosparc_worker
Run: cryosparcw update --override

achintangal · March 8, 2021, 9:56pm

Stephan,

It worked! While running the update on the worker node, I ran into some permissions errors. After fixing those, its back in business.

Strangely enough, I didn’t see any such errors while running the command on the master. The update process went through all the registered worker nodes with no errors. Perhaps, these errors are not communicated via ssh.

Thanks a lot for your help.