Patch motion taking hours

chiara.rapisarda · March 27, 2024, 3:02pm

Hello, we are processing eer micrographs with pixel size 0.58 binned twice on cryosparc using patch motion and 2600 images took almost 6 hours to complete. Is that normal since relion can do the motion correction of 1000 in less than 1 hour? Are we running it wrong. What is a normal speed?

leetleyang · March 27, 2024, 6:15pm

EER upsampling=2 at import perhaps? This does slow things down quite a bit relative to 4K-rendering, which RELION defaults to.

Cheers,
Yang

rbs_sci · March 27, 2024, 8:32pm

Depends a lot on where the micrographs are, although for high mag EER data file sizes should be small enough not to be a problem.

My guess is, as @leetleyang also hypothesised, that you left EER sampling at 2. This will output 8K movies, which isn’t the issue… the bigger issue will be that CryoSPARC struggles with CTF estimation at very small pixel sizes, so chances are Patch CTF Estimation will fail.

It’s interesting, because I usually find RELION motion correction (using “RELIONCor”) is significantly slower than Patch Motion Correction of EER in CryoSPARC…

chiara.rapisarda · March 28, 2024, 11:34am

I am not upsampling. We did patch motion on 9000 images in 20 hours, pixel size 0.58, 45 e/A dose.
Is that normal when I get 1000 movies/hour in Relion?

leetleyang · March 28, 2024, 12:32pm

I see.

Can you confirm the number of GPUs you’re parallelising Patch Motion Correction across, and in comparison, the MPI/thread values, nodes, and cores-per-node in your RELION scenario?

Cheers,
Yang

chiara.rapisarda · March 28, 2024, 1:21pm

I am using 1 GPU for the patch motion and for relion I am using 32 MPI/20 threads, no GPUs.
would using more CPUs make the jobs in cryosparc faster?

leetleyang · March 28, 2024, 1:36pm

Thank you. In that case, the throughput you’re seeing is quite normal, on both fronts.

Your RELION parameters instruct RelionCor to process 32 movies in parallel (each in 20-frame batches). What CPU-processing loses out in compute, it makes up for in multithreadedness and parallelisation. I assume the MPIs are spread across several nodes in your cluster, which alleviates network I/O bottlenecks.

On the cryoSPARC front, 9000 EER moviestacks over 20 hours works out to 7.5 stacks/min, which feels normal to me given the 1 GPU-assignment. The good news is that this will scale somewhat linearly with the number of GPUs you assign to the task–within reason. Try increasing Number of GPUs to parallelize to 2 or 4 (depending on your workstation/node configuration) and see if that helps.

Cheers,
Yang

chiara.rapisarda · March 28, 2024, 1:42pm

We have limited number of tokens and cannot really afford to use 4 GPUs. Is there any way to accelerate the process by increasing the threads in the cryosparc jobs and not making is so realiant on GPUs?

leetleyang · March 28, 2024, 1:59pm

Unfortunately not. All compute-heavy tasks in cryoSPARC are, by design, GPU-accelerated.

You can limit downtime by preprocessing movies in cryoSPARC Live as they’re being generated/transferred. A modern GPU would be able to keep pace with typical collection throughputs.

Another option is to perform motion correction in RELION and import the averaged micrographs into cryoSPARC for downstream processing. I believe the community has generally found Patch Motion Correction to do a better job, for whatever reason, but RelionCor is perfectly workable. However, this workflow will preclude Reference-Based Motion Correction. As Sanofi seems to have access to a lot CPU resources, Bayesian polishing would be a viable alternative.

Cheers,
Yang

apunjani · March 29, 2024, 5:02pm

Confirming the post from @leetleyang, CryoSPARC’s Patch motion correction is GPU accelerated and does not have a CPU-only option. Parallelism over multiple GPUs generally provides a near-linear performance increase unless there are other bottlenecks such as the filesystem. Performance will depend on a number of factors:

For EER data, EER Upsampling Factor parameter (default 2x) causes the data to be processed in 2x super resolution. We set this default in order to be conservative and ensure that the available signal in the EER files can be made use of downstream. However, in some cases a 2x upsampling factor may not be required and can be set to 1x instead. This is especially the case if micrographs will be downsampled via Output F-crop factor anyway. Changing to 1x reduces the number of pixels processed during motion correct by a factor of 4.
Filesystem performance can have a large impact. You can see the relative times of loading vs. processing data in the output of Patch motion jobs, e.g.:

-- 0.0: processing 2 of 20: J1/imported/14sep05c_00024sq_00004hl_00002es.frames.tif
        loading /bulk6/data/cryosparcdev_projects/P103/J1/imported/14sep05c_00024sq_00004hl_00002es.frames.tif
        Loading raw movie data from J1/imported/14sep05c_00024sq_00004hl_00002es.frames.tif ...
        Done in 2.26s
        Loading gain data from J1/imported/norm-amibox05-0.mrc ...
        Done in 0.00s
        Processing ...
        Done in 6.99s
        Completed rigid and patch motion with (Z:5,Y:8,X:8) knots
        Writing non-dose-weighted result to J2/motioncorrected/14sep05c_00024sq_00004hl_00002es.frames_patch_aligned.mrc ...
        Done in 0.39s
        Writing 120x120 micrograph thumbnail to J2/thumbnails/14sep05c_00024sq_00004hl_00002es.frames_thumb_@1x.png ...
        Done in 0.31s
        Writing 240x240 micrograph thumbnail to J2/thumbnails/14sep05c_00024sq_00004hl_00002es.frames_thumb_@2x.png ...
        Done in 0.00s
        Writing dose-weighted result to J2/motioncorrected/14sep05c_00024sq_00004hl_00002es.frames_patch_aligned_doseweighted.mrc ...
        Done in 0.38s
        Writing background estimate to J2/motioncorrected/14sep05c_00024sq_00004hl_00002es.frames_background.mrc ...
        Done in 0.01s
        Writing motion estimates...
        Done in 0.00s

Different GPU models will have different performance. Typically GPU memory bandwidth has the most significant impact on performance

Hope this helps!

adesgeorges · November 17, 2024, 7:03pm

I have a similar problem of speed.
I tested running patch motion with 1 2 and 4 gpus with puzzling results:
1.5min for 5 movies with 1 gpu, more than 5 min with 2gpus, experiment canceled with 4 gpus as it was even longer.
I have A6000 GPUs on a very solid system with a ton or RAM, but something seem to be jamming it.

I noticed a similar behavior with Patch CTF, and also with NU-refine.
If I run only one NU-refine job with ~20k particles, it takes ~2min to run.
As soon as I run a second NU-refine job at the same time on a different GPU, the run time jumps to 10 min. More than 2 jobs, hours. I am running cryosparc v4.6.0

wtempel · November 18, 2024, 4:04pm

@adesgeorges Please can you post the output of

this command on the CryoSPARC master computer
```
cryosparcm cli "get_scheduler_targets()"
```

this command on the GPU server

cat /sys/kernel/mm/transparent_hugepage/enabled
free -h

adesgeorges · November 18, 2024, 6:51pm

cryosparcm cli "get_scheduler_targets()"
[{'cache_path': '/ssd/cryosparc_cache/', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 51041271808, 'name': 'NVIDIA RTX A6000'}, {'id': 1, 'mem': 51041271808, 'name': 'NVIDIA RTX A6000'}, {'id': 2, 'mem': 51041271808, 'name': 'NVIDIA RTX A6000'}, {'id': 3, 'mem': 51041271808, 'name': 'NVIDIA RTX A6000'}, {'id': 4, 'mem': 51041271808, 'name': 'NVIDIA RTX A6000'}, {'id': 5, 'mem': 51041271808, 'name': 'NVIDIA RTX A6000'}, {'id': 6, 'mem': 51041271808, 'name': 'NVIDIA RTX A6000'}, {'id': 7, 'mem': 51041271808, 'name': 'NVIDIA RTX A6000'}], 'hostname': 'headnode', 'lane': 'default', 'monitor_port': None, 'name': 'headnode', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103], 'GPU': [0, 1, 2, 3, 4, 5, 6, 7], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63]}, 'ssh_str': 'cryosparc_user@headnode', 'title': 'Worker node headnode', 'type': 'node', 'worker_bin_path': '/spshared/apps/cryosparc4/cryosparc_worker/bin/cryosparcw'},
 {'cache_path': '/ssd/cryosparc_cache', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': [], 'custom_vars': {}, 'desc': None, 'hostname': 'headnode_slurm', 'lane': 'headnode_slurm', 'name': 'headnode_slurm', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/bin/bash\n#SBATCH --job-name=cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH --partition=A6000\n#SBATCH --output={{ job_log_path_abs }}\n#SBATCH --error={{ job_log_path_abs }}\n{%- if num_gpu == 0 %}\n#SBATCH --ntasks={{ num_cpu }}\n#SBATCH --cpus-per-task=1\n#SBATCH --threads-per-core=1\n{%- else %}\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node={{ num_cpu }}\n#SBATCH --cpus-per-task=1\n#SBATCH --threads-per-core=1\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH --gres-flags=enforce-binding\n{%- endif %}\n\navailable_devs=""\nfor devidx in $(seq 0 7);\ndo\n    if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then\n        if [[ -z "$available_devs" ]] ; then\n            available_devs=$devidx\n        else\n            available_devs=$available_devs,$devidx\n        fi\n    fi\ndone\nexport CUDA_VISIBLE_DEVICES=$available_devs\n\n{{ run_cmd }}\n', 'send_cmd_tpl': '{{ command }}', 'title': 'headnode_slurm', 'tpl_vars': ['num_gpu', 'num_cpu', 'project_uid', 'job_log_path_abs', 'cluster_job_id', 'command', 'job_uid', 'run_cmd'], 'type': 'cluster', 'worker_bin_path': '/spshared/apps/cryosparc4/cryosparc_worker/bin/cryosparcw'}]

adesgeorges · November 18, 2024, 6:53pm

cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never

wtempel · November 18, 2024, 7:21pm

This setting is problematic. To confirm its impact, you can test if multi-GPU or multi-job performance improves if you disable transparent_hugepage with the commands

sudo sh -c "echo never > /sys/kernel/mm/transparent_hugepage/enabled"

This command would disable transparent_hugepage temporarily. See here for how to change the setting permanently.
CryoSPARC v4.6.2 includes a workaround that is effective when

cat /sys/kernel/mm/transparent_hugepage/enabled

shows
always [madvise] never.

adesgeorges · November 18, 2024, 11:34pm

WOW! Massive difference! Now it scales in the right direction!

adesgeorges · November 19, 2024, 2:07am

May I ask: in what situations having this parameter set like this might be detrimental? Why would it be enabled as default if it is so damaging to performance?

rbs_sci · November 19, 2024, 2:26am

Because it’s not always damaging to performance, and can for some things significantly improve performance and memory handling. When a CPU accesses memory, it examines the Translation Lookaside Buffer (TLB) to find where it needs to look (which is stored on the CPU) and a “miss” in the caching can have significant impact as they it needs to look in main system RAM (which is much slower than on-die CPU cache). Reducing the number of misses improves performance.

I don’t know how much of a rabbit hole you want to go down, but some things (like the Redis database, and CryoSPARC itself) have severe performance regressions with THP enabled. In fact, databases in general seem to see significant loss of performance, which if I understand it correctly is likely due to the fact that databases rarely read memory contiguously, but are accessing in a much more stochastic fashion. I went down this rabbit hole myself a few months ago when it became obvious that THP was a pain point for CryoSPARC - I wasn’t sure whether it was database related or whether cryo-EM data processing in general was impacted, but I haven’t seen RELION act as sensitively as CryoSPARC to THP being enabled or not.

This site has a reasonable rundown if you just want a one page read to sate curiosity.

Rajan · November 19, 2024, 12:08pm

hey @adesgeorges,

This improvement is just because of updating to 4.6.2 or did you also disable the transparent_hugepage ?

Thanks

adesgeorges · November 19, 2024, 12:24pm

I did not update, only disabled transparent_hugepage.