Performance benchmarks?

olibclarke · September 10, 2020, 6:11pm

Hi,

I am trying to troubleshoot performance issues on my GPU, and figure out whether they are hardware or software related.

Does anyone (@stephan?) have completion times for the benchmark workflow (https://cryosparc.com/docs/tutorials/extensive-workflow) with or without the “all job types” option?

Cheers
Oli

nfrasser · September 10, 2020, 10:07pm

March 2021 Edit: Updated benchmarks for cryoSPARC 3.1

Hi @olibclarke, here are the results from our tests with cryoSPARC v3.1 on a 4GPU machine with the T20S subset:

The Extensive Workflow takes ~1 hour with the default settings and ~1 hour 30 minutes with all job types enabled (note that some jobs run in parallel when enough GPUs are available).

Here are some rough average completion times for each job type:

Job Type	Approximate Run Time (seconds)
Import Movies	92
Patch Motion Correction (Multi)	220
Full Frame Motion Correction (Multi)	75
Patch CTF Estimation (Multi)	66
Curate Exposures	1.1
Blob Picker	12
Template Picker	13
Inspect Picks	12
Extract from Micrographs (CPU)	39
Extract from Micrographs (GPU)	43
Local Motion Correction	180
Select 2D Classes	7.5
Ab-Initio Reconstruction (1 class)	450
Ab-Initio Reconstruction (3 class)	800
Homogenous Refinement	1940
Heterogeneous Refinement (3 class)	3000
Non-Uniform Refinement	4300
Sharpen	32
Validation	94
Global CTF Refinement	41
Local CTF Refinement	46
3D Variability	560
3D Variability Display	140

olibclarke · September 10, 2020, 10:43pm

That is exceedingly helpful, thank you @nfrasser!!

olibclarke · September 10, 2020, 11:22pm

It seems like my root dir is very small (18G) and cryosparc was filling up /tmp/. somehow that was causing a slowdown I think, because when I move /tmp/ and symlink it back to root, I get results comparable to what @nfrasser posted, whereas previously it completely stalled overnight. Don’t know if that makes any sense but fingers crossed I think the issue is fixed, regardless!

olibclarke · November 22, 2020, 5:24pm

I am still having issues with this same GPU… the performance benchmarks look fine, but when I run a bunch of jobs sequentially with large particles (512 box size) they get progressively slower. Particularly noticeable with local decomposition and FSC calculation steps which become extremely slow (many hours). The GPUs look fine in terms of temperature and usage, so I’m not sure what the issue could be - could it be the SSD? One NU-refine job only completed the first iteration (8000 particles) overnight…

apunjani · November 23, 2020, 4:35pm

Hi @olibclarke,

We have seen something like this before a few times and it’s been related to the operating system using system RAM for file caching. Theoretically, the OS is supposed to use empty RAM to cache recently read files and then evict those files when a process requests RAM. But for some reason, when the system is under load and there are lots of memory allocations on CPU going on (eg during FSC computation or NU-refine as you said) the OS becomes very slow at evicting cached files.

You can see if this is the case using htop. Also worth checking that the system is not swapping, to be sure.

In order to fix this, what we do is add this line to the root user crontab:

  * * * * * sync && echo 1 > /proc/sys/vm/drop_caches

(you can edit the crontab by sudo crontab -e)

This causes the system to drop the file cache every minute. Bit of a sledgehammer solution but it works well every time for us.

Let us know if that helps!

olibclarke · November 23, 2020, 4:48pm

Thank you Ali I will try that immediately! This is what my htop looks like:

And re swapping:

apunjani · November 23, 2020, 5:33pm

Hi @olibclarke,
It looks like your system has as some point in the past been swapping (swap is full) but it’s the blue part of the RAM indicator that is file cache. The actual used RAM (green) is only half the system but the rest is cached files. Once you add that crontab line you should see the blue part disappear and everything should run fast again.

Another quick clue is that you see that a lot of CPUs are engaged but their usage bar is red rather than green. Green CPU means actually doing process work, red means running system calls (like evicting filecache as new memory allocations are made). So all the processes are stalled waiting for the OS.

apunjani · November 23, 2020, 5:35pm

Also, once you turn off file caching, you can do:
swapoff -a && swapon -a
as root. This will empty the swap area (paging back to RAM) and get the system back to fully normal.

olibclarke · November 23, 2020, 5:38pm

Thank you @apunjani) that is extremely helpful!

It is running a lot faster now!

Is there any reason to have swapping enabled if I have enough ram (256G in this case)?

olibclarke · November 23, 2020, 6:03pm

Here is the log for a run before using @apunjani’s crontab fix:

and here is a run after:

quite a difference!

Oli

olibclarke · November 23, 2020, 7:00pm

Strangely, the first set of jobs I ran proceeded fine for the first iteration, but then all were marked as failed with a “no heartbeat for 30s” message - but the jobs continued, despite being marked as failed! Not sure if this is related to the modification to crontab, or if there is something else I need to alter?

Cheers
Oli

olibclarke · November 25, 2020, 12:48pm

@apunjani after applying your crontab fix I am repeatedly getting these “job unresponsive, no heartbeat received for 30 seconds” error messages, causing jobs to move to failed state but continue running.

This would be fine except that then messes with GPU allocation - another job in the queue is assigned to the GPU of the “failed” job, and then both jobs fail “for real”. Suggestions welcome!

Cheers
Oli

olibclarke · November 25, 2020, 3:44pm

Ok I “fixed” this by increasing CRYOSPARC_HEARTBEAT_SECONDS, as described here (Job is unresponsive - no heartbeat received in 30 second), still not sure why this error appeared after modifying the crontab

olibclarke · December 9, 2020, 6:36pm

Actually @apunjani I am still persistently seeing these “no heartbeat” errors after applying your fix, even in 3.0, any advice?

Cheers
Oli

david.haselbach · March 9, 2021, 9:43pm

We have the same issue. Quite often now. Is there any fix for this?

apunjani · March 10, 2021, 3:59pm

Hi @david.haselbach,
Did you also add the drop_caches to crontab, and started seeing this after?

Can you also paste the output of the joblog command for this job:
cryosparcm joblog <project_uid> <job_uid>

david.haselbach · March 10, 2021, 9:11pm

No we haven’t tried the drop caches yet just the CRYOSPARC_HEARTBEAT_SECONDS. I will try that and see whether this fixes it.