cryoSPARC master unresponsive and high memory load

ctueting · August 16, 2023, 12:02pm

Hi cryoSPARC Dev Team,

We recently updated our cryoSPARC instance from v3.3.2 to v4.3. Additionally, we updated the OS from CentOS 7 to Ubuntu 22.04. The cryoSPARC installation was on a separate /home partition and was untouched during the OS update.

After the OS update, cryoSPARC was simply updated using the command:

cryosparcm update

Until here, everything ran very smoothly. The new interface loaded, and we can access all the previous jobs.

However, the interface is now super slow. The elements inside cryoSPARC (e.g., after clicking on a job) are loading very slowly. The first very odd thing was during “Select 2D.” Either from the cart or building from scratch, from time to time, cryoSPARC asks us to queue this job on a worker - which fails. After waiting some time, we can actually queue the job on the master node.

To identify the issue of the slow interface, I checked the server’s load: The CPU load is low, with peaks up to 25% (100% if there is a current job executed). But the memory is really loaded; 50 out of 62 GB are full.

PID 	 SWAP 	 RSS 	 COMMAND
9564 	 15.4G 	 43.9G 	 python -c import cryosparc_command.command_core as serv; serv.start(port=39002)
9463 	 835.5M 	 3.3G 	 mongod --auth --dbpath /home/cryosparc_user/cryosparc/cryosparc_database --port 39001 --oplogSize 64 --replSet meteor --nojournal --wiredTigerCacheSizeGB 4 --bind_ip_all
9657 	 619.0M 	 1.2G 	 /home/cryosparc_user/cryosparc/cryosparc_master/cryosparc_app/api/nodejs/bin/node ./bundle/main.js
9598 	 27.3M 	 155.2M 	 python -c import cryosparc_command.command_rtp as serv; serv.start(port=39005)
9639 	 14.8M 	 103.7M 	 node dist/server/index.js

The command_core of cryoSPARC is using 43G of RAM and an additional 15G of Swap. I suspect that this is the reason why the interface is so unresponsive.

We currently have the legacy web app active, but stopping it does not reduce memory usage.

Our master node has the following specifications:

Intel(R) Xeon(R) CPU E3-1225 v5 @ 3.30GHz (4 cores)
64 GB Memory
741 GB home partition (where cryoSPARC is residing)

Any idea why this command_core process is using that much memory? Currently, our users are running some important jobs, so restarting cryoSPARC is not possible at the moment. As soon as this is possible, I’ll try a simple restart of the cryoSPARC instance.

Best, Christian

sdawood · August 16, 2023, 12:43pm

Hi @ctueting ,

Thanks for the post and sorry to hear about the issue you’re facing.

To help us debug, please email us the tgz file within the cryosparc_master/run directory produced by the command cryosparcm snaplogs (I will send you the email address by direct message).

Guide: snaplogs command reference

Regards,
Suhail

sdawood · August 16, 2023, 2:31pm

Hi @ctueting ,

Thanks for sending over the logs. There isn’t anything that immediately points to a particular issue regarding high memory usage. It’s possible there was some sort of network interruption as the web application logs report a broken Websocket connection - this is used to display certain information in the interface and could be the cause of the UI slowness you reported.

Our team recommends restarting the instance at your earliest connivence. If the memory issues still persist, please let us know.

After stopping CryoSPARC (cryosparcm stop), please ensure no processes are still running: ps axuww | grep -e cryosparc -e mongo before restarting (cryosparcm start).

- Suhail

ctueting · August 16, 2023, 3:12pm

Thanks for the reply.

I will restart the instance asap.
I am currently watching the live cryosparcm log command_core and I am observing this:

2023-08-16 17:07:06,569 dump_workspaces      INFO     | Exported all workspaces in P62 to /cryosparc_projects/server1/P62/workspaces.json in 0.08s
2023-08-16 17:07:06,605 dump_workspaces      INFO     | Exporting all workspaces in P62...
2023-08-16 17:07:06,686 dump_workspaces      INFO     | Exported all workspaces in P62 to /cryosparc_projects/server1/P62/workspaces.json in 0.08s
2023-08-16 17:07:17,629 dump_workspaces      INFO     | Exporting all workspaces in P62...
2023-08-16 17:07:17,714 dump_workspaces      INFO     | Exported all workspaces in P62 to /cryosparc_projects/server1/P62/workspaces.json in 0.09s
2023-08-16 17:07:17,749 dump_workspaces      INFO     | Exporting all workspaces in P62...
2023-08-16 17:07:17,825 dump_workspaces      INFO     | Exported all workspaces in P62 to /cryosparc_projects/server1/P62/workspaces.json in 0.08s
2023-08-16 17:07:28,721 dump_workspaces      INFO     | Exporting all workspaces in P62...
2023-08-16 17:07:28,818 dump_workspaces      INFO     | Exported all workspaces in P62 to /cryosparc_projects/server1/P62/workspaces.json in 0.10s
2023-08-16 17:07:28,857 dump_workspaces      INFO     | Exporting all workspaces in P62...
2023-08-16 17:07:28,927 dump_workspaces      INFO     | Exported all workspaces in P62 to /cryosparc_projects/server1/P62/workspaces.json in 0.07s
2023-08-16 17:07:39,917 dump_workspaces      INFO     | Exporting all workspaces in P62...
2023-08-16 17:07:40,002 dump_workspaces      INFO     | Exported all workspaces in P62 to /cryosparc_projects/server1/P62/workspaces.json in 0.08s
2023-08-16 17:07:40,041 dump_workspaces      INFO     | Exporting all workspaces in P62...
2023-08-16 17:07:40,127 dump_workspaces      INFO     | Exported all workspaces in P62 to /cryosparc_projects/server1/P62/workspaces.json in 0.09s
2023-08-16 17:07:50,992 dump_workspaces      INFO     | Exporting all workspaces in P62...
2023-08-16 17:07:51,113 dump_workspaces      INFO     | Exported all workspaces in P62 to /cryosparc_projects/server1/P62/workspaces.json in 0.12s
2023-08-16 17:07:51,149 dump_workspaces      INFO     | Exporting all workspaces in P62...
2023-08-16 17:07:51,219 dump_workspaces      INFO     | Exported all workspaces in P62 to /cryosparc_projects/server1/P62/workspaces.json in 0.07s
2023-08-16 17:08:02,078 dump_workspaces      INFO     | Exporting all workspaces in P62...
2023-08-16 17:08:02,160 dump_workspaces      INFO     | Exported all workspaces in P62 to /cryosparc_projects/server1/P62/workspaces.json in 0.08s
2023-08-16 17:08:02,198 dump_workspaces      INFO     | Exporting all workspaces in P62...
2023-08-16 17:08:02,270 dump_workspaces      INFO     | Exported all workspaces in P62 to /cryosparc_projects/server1/P62/workspaces.json in 0.07s
2023-08-16 17:08:04,392 dump_workspaces      INFO     | Exporting all workspaces in P62...
2023-08-16 17:08:04,494 dump_workspaces      INFO     | Exported all workspaces in P62 to /cryosparc_projects/server1/P62/workspaces.json in 0.10s
2023-08-16 17:08:13,384 dump_workspaces      INFO     | Exporting all workspaces in P62...
2023-08-16 17:08:13,460 dump_workspaces      INFO     | Exported all workspaces in P62 to /cryosparc_projects/server1/P62/workspaces.json in 0.08s
2023-08-16 17:08:13,503 dump_workspaces      INFO     | Exporting all workspaces in P62...
2023-08-16 17:08:13,588 dump_workspaces      INFO     | Exported all workspaces in P62 to /cryosparc_projects/server1/P62/workspaces.json in 0.08s
2023-08-16 17:08:24,486 dump_workspaces      INFO     | Exporting all workspaces in P62...
2023-08-16 17:08:24,560 dump_workspaces      INFO     | Exported all workspaces in P62 to /cryosparc_projects/server1/P62/workspaces.json in 0.07s
2023-08-16 17:08:24,602 dump_workspaces      INFO     | Exporting all workspaces in P62...
2023-08-16 17:08:24,673 dump_workspaces      INFO     | Exported all workspaces in P62 to /cryosparc_projects/server1/P62/workspaces.json in 0.07s
2023-08-16 17:08:35,581 dump_workspaces      INFO     | Exporting all workspaces in P62...
2023-08-16 17:08:35,661 dump_workspaces      INFO     | Exported all workspaces in P62 to /cryosparc_projects/server1/P62/workspaces.json in 0.08s

Nobody is working on that workspace currently.

And the file is altered every few seconds, when observing like this watch -n1 'ls -la --time-style=full-iso workspaces.json'. The file size remains the same, but the timestamps changes. So cryoSPARC is actually writing and writing and … the file to the harddrive.

Might this be related to the high memory load?

Best
Christian

stephan · August 16, 2023, 3:44pm

Hi @ctueting,

Do you have a CryoSPARC Live session in that project that is currently running? If it’s not needed, pausing it should stop the export actions. But I don’t believe this is the cause of the high memory load- this action doesn’t require much memory. Though it would be good to know if this is the cause.

ctueting · August 16, 2023, 6:10pm

Wow … there was actually still a live session. Started 6 months ago, running the entire time. Survived downtimes, and an upgrade to v4.3. But stopping this job does not free’d any memory.

I restarted the server, and the memory load is back to normal. Around 3G of usage.

If we find out, how to trigger this, I’ll post again, but till then, I think the solution is ‘the’ gold standard solution: Restarting.

Thanks for your continous support.

stephan · August 16, 2023, 9:16pm

Good to know, we have a timely fix coming out in the next release that will make sure any running CryoSPARC Live sessions are paused when CryoSPARC is restarted.

ctueting · August 17, 2023, 6:30am

After restarting yesterday evening, we run some extraction and refinement jobs.
The memory usage is over 50%, and it’s again the python command_core taking the majority of memory:

PID 	 SWAP 	 RSS 	 COMMAND
1130941 	 870.6M 	 31.7G 	 python -c import cryosparc_command.command_core as serv; serv.start(port=39002)
1130831 	 168.9M 	 1.9G 	 mongod --auth --dbpath /home/cryosparc_user/cryosparc/cryosparc_database --port 39001 --oplogSize 64 --replSet meteor --nojournal --wiredTigerCacheSizeGB 4 --bind_ip_all

If I check the memory, it’s stable. Every other second it’s a bit more, but the next it’s less. So there is no constant leakage.

Do you want me to do some more analysis or logging of memory? Restarting every other 12 hours is not applicable.

Could this be related with the installation itself? The master node was installed ˜18 months ago and till then, only updates were installed. As cryoSPARC v4.3 is running now, should I try a real clean fresh installation, and specify the database path during installation?
Maybe, this will fix this, as debugging upgrade artifacts is hard.

Edit: What is also a bit strange is the high PID. As this number is continous in ubuntu, this high number is somehow unexpected. I reinstalled Ubuntu on monday, so it’s really vanilla. But after 2.5 days, already 1.3m processes were started. And only cryosparc is running on this machine. The only modification is, that i changed the bootloader from “quiet splash” to “text nomodeset”, as somehow the internal Intel Graphics made some issues. Ubuntu was unbootable in this case. But I am not sure, if this is related to the cryoSPARC issue. if the webui does not use any GPU acceleration, this should irrelevant.

Edit2:
5h later, we have now 10G more memory load, and 5.5G more Swap usage:

PID 	 SWAP 	 RSS 	 COMMAND
1130941 	 6.4G 	 41.9G 	 python -c import cryosparc_command.command_core as serv; serv.start(port=39002)

stephan · August 17, 2023, 3:49pm

If you can do this, that would be really helpful. I’ll analyze your latest logs until you can provide a memory analysis of the server, but so far it seems like the logs aren’t telling us why memory is increasing.

The server is mostly tied to the database, so unless you start with a fresh one, I don’t think reinstalling would help. However if we exhaust all our options we can try that.

For certain elements of the UI (CSS transitions, canvas, webGL plots) the browser will use the GPU. But I would expect to see the app service raising flags.

ctueting · August 17, 2023, 5:48pm

Hi, I saw your mail and will send you the logs soon.

For the beginning, I can log the memory usage of the cryosparc processes every 10 seconds so one can correlate memory usage with the cryosparc logs.

If nothing works, we can also start with a completely new database and re-import all jobs. But this should be the very last solution.

I’ll try to fix this, but I need to be on the machine in case, ubuntu freezes during restarting due to the i915 driver issue - and the server is at the universities server room.

ctueting · August 21, 2023, 9:59am

Hi @stephan,

I restarted the cryoSPARC server at the weekend and recorded the memory usage.

Here is the memory usage of the process python -c import cryosparc_command.command_core as serv; serv.start(port=39002)

The timerange is from: 2023-08-20 19:33:42 to 2023-08-21 07:26:18
Unfortunatelly, the memory usage is not stepwise, but there is a constant leakage till this morning, where the Swap was loaded.

I checked the command_core log, downloaded from the admin panel, and during this time, not a single job was started. So this is purely cryoSPARC background behaviour.

I restarted the server again (cryoSPARC, not the actual machine), and the memory usage is currently increasing. Based on the last process (the one before the plotted one), it ends with the swap ~80% filled and a relatively constant memory usage.

I can send you the recent cryosparc logs with my memory log (~60mb; tab seperated txt file without header - timestamp in german unfortunatelly), so you can correlate them.

Best
Christian

wtempel · August 21, 2023, 2:55pm

Thanks for the offer. We will reach out via direct message regarding transfer details.

ctueting · August 21, 2023, 5:15pm

Mail send.

If you cannot reproduce the issue on your servers, I am happy to send you more data. Or run cryoSPARC with some debugging (e.g., dumbing the content of gc to an additional log file) on our side. I can imagine, if this is some memory leakage bug, this is hard to identify.

stephan · August 22, 2023, 5:57pm

Hi @ctueting

Going to be asking you questions as I analyze the logs (all my questions are scoped to the time between 2023-08-20 19:33:42 to 2023-08-21 07:26:18)

Did you have any jobs in “launched” status?
Do you submit jobs to a cluster lane at all?

stephan · August 22, 2023, 7:11pm

Would you be comfortable sending us a copy of your entire database (it looks like it’s about 350GB)? If so, I can set you up with credentials to scp it to a node in our datacenter.

stephan · August 22, 2023, 7:13pm

Another option is to make a local copy of your database, and run a custom version of CryoSPARC I’ll send you that outputs diagnostic information to log files that we can further analyze.

stephan · August 22, 2023, 7:41pm

Can you post the output of sudo pmap <PID of command_core process> -x?

ctueting · August 23, 2023, 6:57am

This was from Sunday evening to monday morning and it’s holiday time. I did not check in particular, but as far as I can tell, no job was queued/launched.

No we don’t use the university cluster, we use our own master:worker configuration, where all workers are accessed via ssh. So no cluster environment.

Yes, I am confartable with this. I assume you handle the information with care and keep them enclosed. I will contact you via private message for the details.

Sure, I send the output at feedback[at]structura.bio as it’s exceeds the character limit of this post.

Best Christian

ebirn · August 24, 2023, 9:40am

Hi,
please keep this thread updated with your findings, we might be observing a similar issue:

There was steady memory increase from 14:30 yesterday, until midnight, when the command_core process on the master was killed for OOM. There were only a handful small interactive jobs on that node during that timeframe. There was also ca 500-700Mbit/s incoming network traffic during that time, which stopped when command_core was killed.
As per the supervisor config the command_core process was automatically restarted, after that it stayed at a steady resource usage, and the network traffic towards it stopped.

We’re running all workload on a cluster lane.
As of now, the process seems to be sitting steadily aroung 1.4 GB memory.

If/when I catch it again at its increasing memory, I can also provide a pmap.
I believe we’ve observed this issue 3 times over the last week, where the memory usage of the master node increases until OOM. We typically have around 10 jobs running and maybe 5 launched/waiting in the cluster queue during a regular day.
Best,
Erich

Edit: As this might be related to the overall situation: at time (1430), when the memory increase started, we had to reboot the instance, as it had become unresponsive. At that time there were cluster jobs running, which were not terminated (neither the slurm jobs, nor the cryosparc jobs), we left them running.

stephan · August 24, 2023, 2:27pm

HI @ebirn,

Thanks for posting, this is very helpful!

The network traffic is interesting- are your projects hosted on a storage node that is mounted over NFS?