Webapp get really unresponsive on heavy load

ctueting · January 11, 2024, 10:30am

Hi cryoSPARC dev team,

we’re facing the following issue:

After some time of usage, the cryosparc webapp gets slower and slower. Our current solution is, to just restart the webapp every night with a cronjob.

But today, we have a student practical course with 22 students, and all running the 20S tutorial in a single project (each student has a workspace). We are at 300+ finished jobs just today.

The initial steps were fine, but after the first 2D classifications (to put the progress in context), the webapp got really unresponsive. Restarting the webapp helps, but after like 30 minutes, it’s slow again.

Our master is running on a dedicated server with 4 cores, 64 GB ram and no GPU acceleration. Looking at htop, it’s not overloaded, neither CPU load nor memory usage. Still, the web interface is really laggy.

Restarting webapp, as well as restarting the client browser helps in short term, so I guess, this might be correlate with the node.js caching. But this is currently out of my understanding, as we didn’t do any detailed analysis yet.

Best
Christian

wtempel · January 11, 2024, 7:12pm

@ctueting What version/patch of CryoSPARC do you run?

ctueting · January 12, 2024, 5:50am

Good morning,

it was v.4.4, but I installed the latest update and patch yesterday (running now v4.4.1+240110) as there was no running job after the practical course.

Were there some recent changes? If I can/should track some metrics, I can do this.

Best
Christian

carlos · January 12, 2024, 8:35am

Same thing here, and it’s been a while already, with several previous versions. Every click or dragging starts taking up to 20 seconds for having an effect (pretty annoying, imagine doing manual picking like that). Nothing noticeable by htop when it starts happening - I mean, lots of free memory and disk space, no heavy jobs running… and I admit I am not a Linux expert, so I don’t know what else to look for. It is already good that you say restarting CS works, because I was restarting the server each time. It happens with our two servers, each running a different Linux distro, and different hardware. It is more serious in the less powerful server.

ctueting · January 12, 2024, 8:40am

Do you also oberserve the loading site when opening projects/workspaces? This is the most annoying, waiting seconds for the workspace to show up. Especially, when working/supervising multiple projects.

As a suggestion, don’t restart the entire cryoSPARC, as this kills all the jobs.

cryosparcm restart app just restarts the interface, but the command_core, dealing with jobs is running, is untouched. So you don’t loose any progress and the webapp restart is fast.

carlos · January 12, 2024, 8:41am

All right, I’ll try that next time. Thanks!

wtempel · January 12, 2024, 2:29pm

@ctueting @carlos What URL are users typing into their address bar to access the CryoSPARC UI?
UI Performance issues are expected for URLs of the form
http://<hostname>:<portnumber> unless <hostname> refers to the loopback interface. For example, ssh local portforwarding in combination with a URL like
http://localhost:61000
is expected to work, whereas a URL that combines the http://, as opposed to the https://, protocol with a non-loopback interface, like
http://bigbox.internal:61000
would be problematic.
If you prefer that users access the UI via non-loopback, https:// URL, you may want to set up a reverse proxy.
[Updated 2024-01-12]

carlos · January 12, 2024, 7:42pm

Hi wtempel, in our case the (few) users type the IP address of the server, like http://XX.XX.XX.XXX:39000 .

We ssh only for creating folders and moving files, or for restarting the server.

We are under a good firewall, connections are only allowed from inside the campus or VPN through an app.

wtempel · January 12, 2024, 7:58pm

This access mode may result in performance issues. Please consider a local port forwarding or reverse proxy workaround. For local port forwarding, it is not necessary to ssh into the CryoSPARC master host specifically; I think any_ssh_server with access to the cryosparc_master:39000 port should do. For example, to access the GUI via http://localhost:44444, one could

ssh -L 44444:cryosparc_master:39000 username@any_ssh_server

ctueting · January 13, 2024, 4:03pm

So the issue with the lags is simply because of the way of accessing?

Our users also access the interface via http://masternodeip:39000.

We’re in a university network, and our server is only accessible via the IP (no dns name). Also the server has no direct connection to the outside, so https via certificate is not easy, as it could not be verified (at least to my understanding).

Port forwarding (using ssh) at least on the master node is not wanted, as this gives unreasonable rights to the users (as they can access the cryosparcm command).
Also alot of our users are just users, and have non to minimum cmd experience. The direkt url offers extremly low barrier accessibility.

rbs_sci · January 14, 2024, 12:39am

You don’t need to SSH in as the CryoSPARC user. As long as the port is tunneled. So users could log in as themselves. You could even set up a user specifically for the tunneling SSH connection with no shell (or a custom shell with all commands disallowed) which many years ago our NMR facility used to allow for people who needed to use the floating licenses we had for spectra analyses.

A script which runs the SSH command, followed by the browser pointed at the right place shouldn’t be too hard to put together if they can’t handle PuTTY…

ctueting · January 15, 2024, 8:00am

I know, that it’s not that hard, but it makes it “unhandy”. In my opinion, one of the big strength of cryoSPARC is the web interface, which you can easily access, if being in the same network as the main node. This is one of the great advantages over e.g RELION and Scipion.

Even though, adding this extra step during accession, is still an extra step.

wtempel · January 15, 2024, 10:56pm

@ctueting Please can you send us the following information to help us confirm the cause of UI lags:

At a time when slow UI performance is observed using the http://masternodeip:39000 access mode, please collect and email us web browser debug information as described here.
Test whether ssh local port forwarding in combination with UI access via http://localhost:localport resolves UI lags. If UI lags are not resolved, please collect and email us web browser debug information as in the previous step.

ctueting · January 16, 2024, 7:17am

Good morning,

I already set up a ssh port forwarding account without shell access, and this indeed increases UI speed dramatically.

But in parallel, I can also collect the other data, so you can take a closer look at this.

Best
Christian

ctueting · January 17, 2024, 7:11am

Hi @wtempel

I send the log file via mail.

During preparation, I encountered another error:
I was not able to put more than 9-10 exposures into the cart. This was possible before the latest update.
I needed this, because we had 13 em acquisitions, and to speed analysis up, each was imported, motion corrected and ctf estimated independently. To speed things up. To merge them into a Curate Exposure job, I wanted to put each output into the cart from the workspace, as this is by far the fastest way. The errors for this is Uncaught DOMException: The quota has been exceeded. (see the log file send via mail for more details). This was possible before the latest update, I remember putting all 13 exposures into the cart for template picker.

So I created a job manually with the builder, and opened each patch ctf job manually, and pulled the exposures into the builder. And here, I looked each time at the 9 tile loading screen. It was not as slow as during the practical course, but I hope you can learn smth from this.
Also, during pulling the job into the builder, there was another error in the debug log: Uncaught (in promise) DOMException: Modifications are not allowed for this document but this didn’t affect the job building/queueing.

Best
Christian

wtempel · January 17, 2024, 4:23pm

Thanks @ctueting for sending the log file.
Please can you also record an HAR file (the network output mentioned in the guide) that you record while you observe UI delays and email a compressed copy of that HAR file.

gaorongchao · December 18, 2024, 8:17am

Is this problem solved, and I got the same problem;
it seems not the network, when I clicked to remove particle and the job log immediately show ‘got remove pick’. the cost about 30+seconds to show done remove_pick.
the detail were show blew.