V3.0 SSD option bug

Non-uniform refine job tries to use SSD when SSD disabled and then crashes after trying to load particle stack.

See below…

[CPU: 83.6 MB]   --------------------------------------------------------------

[CPU: 83.6 MB]   Importing job module for job type nonuniform_refine_new...

[CPU: 692.0 MB]   SSD cache : cache does not have enough space for download

[CPU: 692.0 MB]   SSD cache :   but there are no files that can be deleted. 

[CPU: 692.0 MB]   SSD cache :   This could be because other jobs are running and using files, or because a different program has used up space on the SSD. 

[CPU: 692.0 MB]   SSD cache :   Waiting 30 seconds for space to become available... 

[CPU: 499.9 MB]  Job ready to run

[CPU: 500.0 MB]  ***************************************************************

[CPU: 685.1 MB]  Using random seed of 534070744

[CPU: 685.1 MB]  Loading a ParticleStack with 135638 items...

[CPU: 692.0 MB]   SSD cache : cache does not have enough space for download

[CPU: 692.0 MB]   SSD cache :   but there are no files that can be deleted. 

[CPU: 692.0 MB]   SSD cache :   This could be because other jobs are running and using files, or because a different program has used up space on the SSD. 

[CPU: 692.0 MB]   SSD cache :   Waiting 30 seconds for space to become available... 

[CPU: 692.0 MB]   SSD cache : cache does not have enough space for download

[CPU: 692.0 MB]   SSD cache :   but there are no files that can be deleted. 

[CPU: 692.0 MB]   SSD cache :   This could be because other jobs are running and using files, or because a different program has used up space on the SSD. 

[CPU: 692.0 MB]   SSD cache :   Waiting 30 seconds for space to become available... 

[CPU: 692.0 MB]   SSD cache : cache does not have enough space for download

[CPU: 692.0 MB]   SSD cache :   but there are no files that can be deleted. 

[CPU: 692.0 MB]   SSD cache :   This could be because other jobs are running and using files, or because a different program has used up space on the SSD. 

[CPU: 692.0 MB]   SSD cache :   Waiting 30 seconds for space to become available... 

[CPU: 752.9 MB]    Done.

[CPU: 752.9 MB]  Windowing particles

[CPU: 753.1 MB]    Done.

[CPU: 753.1 MB]  ====== Gold Standard Split ======

[CPU: 753.1 MB]    Force re-split for gold-standard split is enabled, so particles will be randomly split into two halves.

[CPU: 769.8 MB]    Split A has 67819 particles 

[CPU: 769.8 MB]    Split B has 67819 particles 

[CPU: 769.8 MB]  ====== Refinement ======

[CPU: 769.8 MB]    Input particles have box size 270

[CPU: 769.8 MB]    Input particles have pixel size 1.0780

[CPU: 769.8 MB]    Particles will be zeropadded/truncated to size 270 during alignment

[CPU: 769.8 MB]    Volume refinement will be done with effective box size 270

[CPU: 769.8 MB]    Volume refinement will be done with pixel size 1.0780

[CPU: 769.8 MB]    Particles will be zeropadded/truncated to size 270 during backprojection

[CPU: 769.8 MB]    Particles will be backprojected with box size 270

[CPU: 769.8 MB]    Volume will be internally cropped and stored with box size 270

[CPU: 769.8 MB]    Volume will be interpolated with box size 270 (zeropadding factor 1.00)

[CPU: 769.8 MB]    DC components of images will be ignored and volume will be floated at each iteration.

[CPU: 769.8 MB]    Spherical windowing of maps is enabled

[CPU: 769.8 MB]    Refining with C1 symmetry enforced

[CPU: 770.0 MB]    Resetting input per-particle scale factors to 1.0

[CPU: 770.0 MB]    Starting at initial resolution 30.000A (radwn 9.702). 

[CPU: 770.0 MB]  ====== Masking ======

[CPU: 1.06 GB]     No mask input was connected, so dynamic masking will be enabled. 

[CPU: 1.06 GB]     Dynamic mask threshold: 0.2000 

[CPU: 1.06 GB]     Dynamic mask near (A): 6.00 

[CPU: 1.06 GB]     Dynamic mask far  (A): 14.00 

[CPU: 1.06 GB]   ====== Initial Model ======

[CPU: 1.06 GB]     Resampling initial model to specified volume representation size and pixel-size...

[CPU: 1.21 GB]     Estimating scale of initial reference. 

[CPU: 1.75 GB]   Traceback (most recent call last):
  File "/home/cryosparc_user/V3.X/cryosparc_worker/cryosparc_compute/jobs/runcommon.py", line 1711, in run_with_except_hook
    run_old(*args, **kw)
  File "cryosparc_worker/cryosparc_compute/engine/cuda_core.py", line 129, in cryosparc_compute.engine.cuda_core.GPUThread.run
  File "cryosparc_worker/cryosparc_compute/engine/cuda_core.py", line 130, in cryosparc_compute.engine.cuda_core.GPUThread.run
  File "cryosparc_worker/cryosparc_compute/engine/engine.py", line 997, in cryosparc_compute.engine.engine.process.work
  File "cryosparc_worker/cryosparc_compute/engine/engine.py", line 80, in cryosparc_compute.engine.engine.EngineThread.load_image_data_gpu
  File "cryosparc_worker/cryosparc_compute/engine/cuda_core.py", line 331, in cryosparc_compute.engine.cuda_core.EngineBaseThread.ensure_allocated
  File "/home/cryosparc_user/V3.X/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/pycuda/gpuarray.py", line 210, in __init__
    self.gpudata = self.allocator(self.size * self.dtype.itemsize)
pycuda._driver.MemoryError: cuMemAlloc failed: out of memory


[CPU: 692.0 MB]   SSD cache : cache does not have enough space for download

[CPU: 692.0 MB]   SSD cache :   but there are no files that can be deleted. 

[CPU: 692.0 MB]   SSD cache :   This could be because other jobs are running and using files, or because a different program has used up space on the SSD. 

[CPU: 692.0 MB]   SSD cache :   Waiting 30 seconds for space to become available... 

[CPU: 692.0 MB]   SSD cache : cache does not have enough space for download

[CPU: 692.0 MB]   SSD cache :   but there are no files that can be deleted. 

[CPU: 692.0 MB]   SSD cache :   This could be because other jobs are running and using files, or because a different program has used up space on the SSD. 

[CPU: 692.0 MB]   SSD cache :   Waiting 30 seconds for space to become available... 

[CPU: 692.0 MB]   SSD cache : cache does not have enough space for download

[CPU: 692.0 MB]   SSD cache :   but there are no files that can be deleted. 

[CPU: 692.0 MB]   SSD cache :   This could be because other jobs are running and using files, or because a different program has used up space on the SSD. 

[CPU: 692.0 MB]   SSD cache :   Waiting 30 seconds for space to become available... 

[CPU: 692.0 MB]   SSD cache : cache does not have enough space for download

[CPU: 692.0 MB]   SSD cache :   but there are no files that can be deleted. 

[CPU: 692.0 MB]   SSD cache :   This could be because other jobs are running and using files, or because a different program has used up space on the SSD.

Hi @MHB,

Thanks for reporting. Can you send us your joblog? cryosparcm joblog <project_uid> <job_uid>
Can you also let us know your volume box size & particle size?

Hi Stephen,

Boxsize is 270 pixel 0.801
Particle size ~160 ang

Logfile-------
================= CRYOSPARCW ======= 2020-12-10 00:43:38.470213 =========
Project P19 Job J358
Master lomatia.colorado.edu Port 39002

========= monitor process now starting main process
MAINPROCESS PID 5471
========= monitor process now waiting for main process
MAIN PID 5471
refine.newrun cryosparc_compute.jobs.jobregister
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat


Running job J358 of type nonuniform_refine_new
Running job on hostname %s lomatia.colorado.edu
Allocated Resources : {‘fixed’: {‘SSD’: False}, ‘hostname’: ‘lomatia.colorado.edu’, ‘lane’: ‘default’, ‘lane_type’: ‘default’, ‘license’: True, ‘licenses_acquired’: 1, ‘slots’: {‘CPU’: [0, 1, 2, 3], ‘GPU’: [3], ‘RAM’: [0, 1, 2]}, ‘target’: {‘cache_path’: ‘/tmp’, ‘cache_quota_mb’: 200, ‘cache_reserve_mb’: 237, ‘desc’: None, ‘gpus’: [{‘id’: 0, ‘mem’: 8513585152, ‘name’: ‘GeForce GTX 1080’}, {‘id’: 1, ‘mem’: 8513978368, ‘name’: ‘GeForce GTX 1080’}, {‘id’: 2, ‘mem’: 8513978368, ‘name’: ‘GeForce GTX 1080’}, {‘id’: 3, ‘mem’: 8513978368, ‘name’: ‘GeForce GTX 1080’}], ‘hostname’: ‘lomatia.colorado.edu’, ‘lane’: ‘default’, ‘monitor_port’: None, ‘name’: ‘lomatia.colorado.edu’, ‘resource_fixed’: {‘SSD’: True}, ‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], ‘GPU’: [0, 1, 2, 3], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]}, ‘ssh_str’: ‘cryosparc_user@lomatia.colorado.edu’, ‘title’: ‘Worker node lomatia.colorado.edu’, ‘type’: ‘node’, ‘worker_bin_path’: ‘/home/cryosparc_user/V3.X/cryosparc_worker/bin/cryosparcw’}}
**custom thread exception hook caught something
**** handle exception rc
set status to failed
========= main process now complete.
========= monitor process now complete.

Hi @MHB can you do the following:

restart cryoSPARC checking for zombie processes:

  1. cryosparcm stop

  2. check ps -ax | grep cryosparc to ensure that no processes are lingering

  3. cryosparcm start

clone the job that failed

run the cloned job again

From the logs it seems that somehow, the job had started twice… once with the correct params and once with default params. We are not sure how this happened!

1 Like

Thanks…restarted and now working. Not sure how this happened but all ok now.

1 Like