Extensive workflow - Import movies fails

eMKiso · April 30, 2020, 8:34pm

Dear all,

I just tried to run Extensive Workflow as suggested on:
https://cryosparc.com/docs/tutorials/extensive-workflow
Installed cryoSPARC version is 2.14.2. I did not change any of the default settings. The worker and master are on the same workstation. Ubuntu OS.

I started the workflow and it fails after a couple seconds at the first job Import movies:
The error is:

    [CPU: 195.2 MB]  Importing movies from /bulk0/data/EMPIAR/10025/data/empiar_10025_subset/*.tif
    [CPU: 195.2 MB]  Traceback (most recent call last):
    File "cryosparc2_master/cryosparc2_compute/run.py", line 82, in cryosparc2_compute.run.main
    File "cryosparc2_compute/jobs/imports/run.py", line 463, in run_import_movies_or_micrographs
    assert len(all_abs_paths) > 0, "No files match!"
    AssertionError: No files match!

It should download some data automatically but that does not seem to happen.
Any ideas what could be the issue?

If you require any additional logs or something I am happy to provide those.

Thank you!
Matic

stephan · April 30, 2020, 9:23pm

Hey @eMKiso,

~~I believe you can get the job to download the movies automatically by deleting the text in the fields for both paths. Can you let me know if that works?~~

Edit: The automatic download will be available within a few days in the upcoming version of cryoSPARC, my apologies for the confusion.

In the mean time, you can download and extract the test data manually as per Step 3 here:

Then be sure to specify the movies and gain reference paths in the Extensive Workflow job description.

eMKiso · April 30, 2020, 9:56pm

Hi,

yeah the empty paths didn’t work. There was a different error.

I actually did download the data an hour ago and ran the workflow with the downloaded files. I was not sure exactly which file to download and downloaded ‘ftp://ftp.ebi.ac.uk/empiar/world_availability/10025/data/14sep05c_aligned_196/14sep05c_c_00003gr_00014sq_00002hl_00005es_st.mrc’ since I could not find any .tif files as is suggested in the default path.

3 jobs finish successfully:

Import
Full-frame motion (M)
CTF Estimation (CTFFIND4))

At the fourth job (Manually Curate Exposures) the workflow fails. The interactive job starts but at this point the ‘master job / workflow’ fails with error:

[CPU: 94.9 MB] Traceback (most recent call last):
File “cryosparc2_master/cryosparc2_compute/run.py”, line 82, in cryosparc2_compute.run.main
File “cryosparc2_compute/jobs/workflows/buildrun_bench.py”, line 174, in run_extensive_workflow
assert counts[‘total’] == 20
AssertionError

It fails before I get the chance to even load/open the interactive job.

I am now downloading the archive with the cryosparcm downloadtest command and I’ll try with that.

eMKiso · April 30, 2020, 10:21pm

OK, tried now with the data that was downloaded by cryoSPARC and it went sucesfully past ‘Manually curate Exposures’. So the error above was connected with the fact that the data present was not what the workflow expected…

It seems be running fine now. I will report if there are any additional errors.
If not this is it, thank @stephan very much for the quick information!

Best,
Matic

stephan · April 30, 2020, 11:45pm

Hey @eMKiso,

Thanks for updating! Yes, the Extensive Workflow job only works with the movies from the cryosparcm downloadtest dataset (which are a 20-movie subset of EMPIAR-10025 converted to .tif files).

eMKiso · October 7, 2020, 11:31am

Hi all,
so I am back to this after an upgrade to 2.15.
This Extensive workflow now works great on a single workstation. Good work!

Now we are facing a different problem when running Extensive Workflow for T20s (BENCH) (BETA) on a cluster.
It works fine until the ‘Extract From Micrographs’ step and here it fails. There it just ‘hangs’. No errors. Stays like this for hours. The log from the interface is here:

Previous steps where GPUs are used seem to work fine.
If I clone ‘Extract From Micrographs’ job that is created during the Extensive Workflow and run it with number of GPUs set to ‘0’ it finishes with any errors.

The GPUs on the cluster are Nvidia Tesla V100, Drivers 435.21, CUDA 10.1

Any ideas?

nfrasser · October 9, 2020, 2:40pm

Hi @eMKiso, can you share the output of the internal job log for the Extract job? You can get this from the command line with this command:

cryosparcm joblog PXX JXX

Substitute PXX and JXX with the project and job numbers for the Extract job. Paste the full output here. You can press Control + C to stop the joblog.

eMKiso · October 11, 2020, 6:04pm

Hi,

sure, below you can find the log. It is from the same job as the output in the previous post. I killed the job manually after some time.

[cryosparc@rm]$ cryosparcm joblog P21 J49

========= CRYOSPARCW =======  2020-10-07 10:22:33.070194  =========
Project P21 Job J49
Master rm Port 39002
===========================================================================
========= monitor process now starting main process
MAINPROCESS PID 23756
========= monitor process now waiting for main process
MAIN PID 23756
extract.run cryosparc2_compute.jobs.jobregister
***************************************************************
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
^C

nfrasser · October 13, 2020, 3:37pm

Hi @eMKiso, thanks for sending that. Based on this, it definitely looks like the worker is getting stuck when it attempts to access the GPU. Could you send the job submission script in the Extract Job directory? It should be in the project directory at PXX/JXX/queue_sub_script.sh

Can you also send a screenshot of the cryoSPARC web interface with the full workspace open up the hanging Extract job?

eMKiso · October 18, 2020, 11:16am

Hi @nfrasser, sorry for the late reply.
Here is the 'queue_sub_script.sh:

#!/bin/bash
#SBATCH --job-name cryosparc_P21_J49
#SBATCH -c 2
#SBATCH --gres=gpu:1
#SBATCH -p grid
#SBATCH --reservation=KI
#SBATCH -t 1-00
#SBATCH --mem=8192
mkdir -p /data1/cryosparc/cryosparc_cache
export MKL_DEBUG_CPU_TYPE=5
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
export PATH=/bin:/usr/bin:/usr/local/bin:/usr/local/cuda/bin:$PATH
singularity exec --nv /ceph/sys/singularity/cryosparc_worker.sif /opt/cryosparc2_worker/bin/cryosparcw run --project P21 --job J49 --master_hostname rm --master_command_core_port 39002 > /ceph/grid/data/cs_extensive_workflow/P21/J49/job.log 2>&1

I hope it helps!

nfrasser · October 21, 2020, 7:57pm

Hi @eMKiso, I’ve brought this up with the rest of the cryoSPARC team and we’re stumped. We’ve never seen this kind of issue with the extensive workflow, even on cluster systems.

I suggest the following:

Check your cluster configuration to see if there’s any bugs in GPU resource allocation
When the Extract job runs, quickly kill and clear it, change the parameters to use 0 GPUs, and queue it on the same lane. If all goes well, the extensive workflow should register the completed job and move on. Afterward check to see if any subsequent jobs have the same issue.

Sorry I can’t be of more help.

Nick

eMKiso · October 21, 2020, 9:11pm

Hi,

thank you for all the effort!
I tried clearing the job but how do I then change the parameters in this cleared job?

nfrasser · October 21, 2020, 9:14pm

To go into Builder mode, kill and clear the job. Then open the workspace and click on the “Building” badge at the centre of the job (or you can select the job card and press the B key on your keyboard).

eMKiso · October 21, 2020, 9:39pm

Oh great, didn’t know that one can select the “Building” badge.

Some limited succes, now it failed at the next “Extract from micrographs” job.

A bit more descriptive this time:

[CPU: 184.6 MB]  Starting multithreaded pipeline ... 
[CPU: 184.8 MB]  Started pipeline
[CPU: 285.4 MB]  GPU 0 using a batch size of 1024
[CPU: 285.6 MB]  -- 0.0: processing J61/motioncorrected/13054666239615727002_14sep05c_00024sq_00003hl_00002es.frames_patch_aligned_doseweighted.mrc
        Writing to /cs_extensive_workflow/P21/J71/extract/13054666239615727002_14sep05c_00024sq_00003hl_00002es.frames_patch_aligned_doseweighted_particles.mrc
[CPU: 285.6 MB]  -- 0.1: processing J61/motioncorrected/16675970042098428134_14sep05c_00024sq_00003hl_00005es.frames_patch_aligned_doseweighted.mrc
[CPU: 876.6 MB]  -- 0.0: processing J61/motioncorrected/2465388814133724455_14sep05c_00024sq_00004hl_00002es.frames_patch_aligned_doseweighted.mrc
[CPU: 877.2 MB]  Traceback (most recent call last):
  File "cryosparc2_compute/jobs/runcommon.py", line 1685, in run_with_except_hook
    run_old(*args, **kw)
  File "/opt/cryosparc2_worker/deps/anaconda/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "cryosparc2_compute/jobs/pipeline.py", line 65, in stage_target
    work = processor.process(item)
  File "cryosparc2_compute/jobs/extract/run.py", line 238, in process
    _, mrcdata = mrc.read_mrc(path_abs, return_psize=True)
  File "cryosparc2_compute/blobio/mrc.py", line 135, in read_mrc
    data = read_mrc_data(file_obj, header, start_page, end_page, out)
  File "cryosparc2_compute/blobio/mrc.py", line 98, in read_mrc_data
    data = n.fromfile(file_obj, dtype=dtype, count= num_pages * ny * nx).reshape(num_pages, ny, nx)
MemoryError
[CPU: 877.2 MB]  Traceback (most recent call last):
  File "cryosparc2_compute/jobs/runcommon.py", line 1685, in run_with_except_hook
    run_old(*args, **kw)
  File "/opt/cryosparc2_worker/deps/anaconda/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "cryosparc2_compute/jobs/pipeline.py", line 65, in stage_target
    work = processor.process(item)
  File "cryosparc2_compute/jobs/extract/run.py", line 271, in process
    cuda_dev = self.cuda_dev, ET=self.ET, timer=timer, batch_size = self.batch_size)
  File "cryosparc2_compute/jobs/extract/extraction_gpu.py", line 176, in do_extract_particles_single_mic_gpu
    output_g[batch_start:batch_end] = ET.output_gpu.get()[:curr_batch_size]
  File "/opt/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/pycuda/gpuarray.py", line 287, in get
    ary = np.empty(self.shape, self.dtype)
MemoryError
[CPU: 1.06 GB]   (1 of 20) Finished processing micrograph 0.

Do you need any additional logs?
Thank you a lot!

eMKiso · October 21, 2020, 9:43pm

I actually had to manually kill this job on the cluster, despite the fact it already showed as Failed in the user interface.
Then I changed the number of GPUs to 0 and it finished just fine.

nfrasser · October 26, 2020, 3:05pm

Hi @eMKiso, apologies for the delay. Thanks for sending this over, it’s been very helpful. We’ve been seeing a number of similar issues specifically with the Extract job and are currently in the middle of investigating. Will update when I know more!

eMKiso · October 26, 2020, 3:28pm

Hi, we also just recently noticed that Extract jobs often fail when running on GPU. Not only during the Extensive workflow.
In case I can help in any way please let me know.
Thank you!