Add new data to existing project

tarek · September 13, 2018, 6:41pm

Hi,
I was wondering if it is possible to simply add new micrographs to an existing project and only process the new files using the previous steps as template, similar to the relion-type “pipeline approach”.
I.e. if import a whole micrograph folder during data collection, the content will certainly increase with time.
For “on-the-fly” processing it would be nice not having to re-estimate all defoci, re-pick all-particles, re-classify and so on…
At the moment I am simply cloning the jobs from day 1 for the next day. Obviously, all the images that were processed before will be processed again, as cryosparc doesn’t recognize them as “completed”.
Any suggestions?

Best,
Tarek

DanielAsarnow · September 13, 2018, 7:17pm

Import the new group of micrographs, replace the input on the cloned job with these new micrographs for CTF, templates, etc. When you finally use the particles add the particles from all the micrograph groups as inputs to the refinement job. After that, you can use the particles from the refinement/ab init/2D job as the input for later steps so you won’t have to drag-and-drop multiple things anymore.

tarek · September 13, 2018, 7:56pm

@DanielAsarnow The point is: how can I add only the new mics?
If I select the session folder the old ones will be imported again.The idea is to import directly during data collection, to have an immediate impression and processing of the data. In the end, batches of selected particles should be combined for final calculations.

DanielAsarnow · September 13, 2018, 8:10pm

Oh, you can just use a wildcard instead of importing the whole directory. Click the first micrograph, then replace foo_0001_bar.mrc with foo_01*_bar.mrc, to get the first 100 (or 101 if you started at 0) micrographs. I dunno if you can shift-click to select ranges in the file browser pop up, but it seems like a good feature to add.

tarek · September 13, 2018, 8:55pm

that’s a way, however it requires the filenames to be consecutive or am I wrong?
Wouldn’t it be easy to add a flag for “already processed” in the database?

What would be really nice is to define a scheme, e.g. in the tree view and just feed this template scheme (including pre-defined settings for ctf estimation, template picking, ncc, box size, extraction etc.) with new input images.

DanielAsarnow · September 13, 2018, 9:47pm

All the files matching the wildcard would be used, consecutive or no.

tarek · September 14, 2018, 8:13am

Maybe it’s obvious, but I don’t get it. apologize.
Here is an example.

20180912_2256_12169094_MC2_DW.mrc
20180912_2257_12169095_MC2_DW.mrc
20180912_2259_12169100_MC2_DW.mrc
20180912_2301_12169101_MC2_DW.mrc
20180912_2302_12169102_MC2_DW.mrc
20180912_2306_12169104_MC2_DW.mrc
20180912_2307_12169105_MC2_DW.mrc
20180912_2309_12169106_MC2_DW.mrc
20180912_2311_12169107_MC2_DW.mrc
20180912_2313_12169108_MC2_DW.mrc
20180912_2315_12169109_MC2_DW.mrc
20180912_2316_12169110_MC2_DW.mrc
20180912_2318_12169111_MC2_DW.mrc
20180912_2319_12169112_MC2_DW.mrc
20180912_2320_12169113_MC2_DW.mrc
20180912_2322_12169114_MC2_DW.mrc
20180912_2324_12169132_MC2_DW.mrc
20180912_2326_12169133_MC2_DW.mrc
20180912_2328_12169134_MC2_DW.mrc
20180912_2330_12169136_MC2_DW.mrc
20180912_2332_12169137_MC2_DW.mrc
20180912_2334_12169138_MC2_DW.mrc
20180912_2335_12169139_MC2_DW.mrc
20180912_2337_12169148_MC2_DW.mrc
20180912_2339_12169149_MC2_DW.mrc
20180912_2341_12169150_MC2_DW.mrc
20180912_2342_12169151_MC2_DW.mrc

How would you add all micrographs created after 20180912_2306_12169104_MC2_DW.mrc using wildcards?
Let’s assume there are 500 images before that were already processed and 500 after.
With bash it’s easy but from within chrome I have no idea how to add all at once.

DanielAsarnow · September 14, 2018, 9:54pm

The file browser is part of cryoSPARC, it’s not the native one from your operating system. I assume it processes the string using glob.glob from the Python standard library (or something similar) - my quick check appears to confirm.

Thus, for example, micrographs/stack_0[4-5]*.mrc will choose all the micrographs from 400 to 599. Indeed, arbitrary numeric ranges aren’t supported, but I don’t think it really matters if a few micrographs are ever reprocessed, like in your example where you had previously imported all of them at first and not fixed ranges from the beginning. If you want to avoid that in the future you could always use a predictable grouping like 0[0-4]*, 0[5-9]*, 1[0-4]*, etc. while only running these imports after that many micrographs have been recorded.

I imagine eventually they’ll add a true incremental on-the-fly feature, by my guess as its own job type, but until then this is probably the only way.

tarek · September 16, 2018, 6:53pm

Thank you Daniel.
That awnsered my question totally. With your suggestions grouping can indeed be done with just a few movements.
However, I appreciate the --do_unfinished or equivalents used by the MRC born tools. I would love to have something similar here.