Cryosparc crashing - sock file

Hi,

Yes, it’s like it disconnects and starts “buffering”, but cannot connect again. If I update the page it says “Unable to connect”


I’ve pasted the output below. (Sorry, I’m a bit inexperienced so not sure I did this properly)

(base) mflab@nextron-Super-Server:~$ eval $(/media/datastore/cryosparc/cryosparc_worker/bin/cryosparcw env)
env | grep PATH
which nvcc
nvcc --version
python -c "import pycuda.driver; print(pycuda.driver.get_version())"
uname -a
free -g
nvidia-smi
CRYOSPARC_PATH=/media/datastore/cryosparc/cryosparc_worker/bin
WINDOWPATH=2
PYTHONPATH=/media/datastore/cryosparc/cryosparc_worker
CRYOSPARC_CUDA_PATH=/usr/local/cuda
LD_LIBRARY_PATH=/usr/local/cuda/lib64:/media/datastore/cryosparc/cryosparc_worker/deps/external/cudnn/lib
PATH=/usr/local/cuda/bin:/media/datastore/cryosparc/cryosparc_worker/bin:/media/datastore/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/bin:/media/datastore/cryosparc/cryosparc_worker/deps/anaconda/condabin:/media/datastore/cryosparc/cryosparc_master/bin:/home/mflab/miniconda3/bin:/home/mflab/miniconda3/condabin:/home/mflab/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/snap/bin
/usr/local/cuda/bin/nvcc
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0
(11, 7, 0)
Linux nextron-Super-Server 5.15.0-58-generic #64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
               total        used        free      shared  buff/cache   available
Mem:             125          11          91           0          21         111
Swap:              1           0           1
Fri Feb 10 09:26:42 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A5000    On   | 00000000:01:00.0  On |                  Off |
| 30%   34C    P8    22W / 230W |    468MiB / 24564MiB |      6%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A5000    On   | 00000000:02:00.0 Off |                  Off |
| 30%   31C    P8    13W / 230W |      6MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2128      G   /usr/lib/xorg/Xorg                210MiB |
|    0   N/A  N/A      2261      G   /usr/bin/gnome-shell               78MiB |
|    0   N/A  N/A     10360      G   ...7/usr/lib/firefox/firefox      151MiB |
|    0   N/A  N/A     13376      G   ...mviewer/tv_bin/TeamViewer       23MiB |
|    1   N/A  N/A      2128      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+
(base) mflab@nextron-Super-Server:~$ 

I was able to run a Patch motion correction job with both GPUs without a problem, but if I run two different NU-refinement jobs simultaneously, it disconnects.

Thank you

For an NU-refinement job that completed, but would have failed if it had been run concurrently with another job, what is the output of the following command (run inside the icli, with the actual project and job identifiers):

project, job = 'P147', 'J96'
max([e.get('cpumem_mb', 0) for e in db.events.find({'project_uid':project, 'job_uid':job})])

Sure, here’s the output:

(base) mflab@nextron-Super-Server:~$ cryosparcm icli
Python 3.8.15 | packaged by conda-forge | (default, Nov 22 2022, 08:49:35) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.33.0 -- An enhanced Interactive Python. Type '?' for help.

 connecting to nextron-Super-Server:39002 ...
 cli, rtp, db, gfs and tools ready to use

In [1]: project, job = 'P8', 'J101'
   ...: max([e.get('cpumem_mb', 0) for e in db.events.find({'project_uid':projec
   ...: t, 'job_uid':job})])
Out[1]: 42558.15625

Non-uniform refinement jobs use a lot of system RAM. Two concurrent, memory-intensive jobs could cause available system RAM to be exhausted and overall system performance to deteriorate.

Sorry, but I accidentally removed the sock file while the cryosparc is running. Then the cryosparc can not restart as this issue Error starting cryosparc: "Could not get database status" - #9 by wtempel say. How can I solve this problem. I am really regreting for not stopping cryosparc first. Thank you very much!!

@XianniZhong You may want to try

Hi,
Did anybody happen to come up with a fix for this? We are having the same issue, where the cryosparc page buffers, and “cryosparcm status” command gives us a ‘… .sock refused connection’. There are no running processes when we check, and removing the sock file allows us to restart cryosparc and continue submitting jobs… had this issue with both patch motion correction and blob picker. Our workstation has 4 RTX 2080 Ti GPUs, could it be a memory issue?
Thanks!

Welcome to the forum @carterwheat.
Please can you provide additional details:

  1. What was the full command you used to check for running processes?
  2. Is this a single workstation-type (master and worker combined on a single computer) CryoSPARC instance?
  3. What are the outputs of these commands on the CryoSPARC master computer:
    free -g
    sudo dmesg | grep -i oom
    

Hi, Thanks for helping.

The command I’ve been using to check running procs:
ps -ax | grep cryosparc

Master and worker are on the same, single workstation.

bell@ub22-04:~$ free -g
               total        used        free      shared  buff/cache   available
Mem:             187           4          15           0         167         181
Swap:              1           0           1
bell@ub22-04:~$ sudo dmesg | grep -i oom
bell@ub22-04:~$

The second command doesn’t seem to give me any output.

Hi @wtempel ,

Update: I have been using a script to clear the cache every hour and it seems to have gotten around the " … .sock refused connection" issue. However I am still having issues, now during during 2D-classification:

`Traceback (most recent call last):
File “/home/bell/programs/cryosparc_worker/cryosparc_compute/jobs/runcommon.py”, line 2192, in run_with_except_hook
run_old(*args, **kw)
File “cryosparc_master/cryosparc_compute/gpu/gpucore.py”, line 134, in cryosparc_master.cryosparc_compute.gpu.gpucore.GPUThread.run
File “cryosparc_master/cryosparc_compute/gpu/gpucore.py”, line 135, in cryosparc_master.cryosparc_compute.gpu.gpucore.GPUThread.run
File “cryosparc_master/cryosparc_compute/jobs/class2D/newrun.py”, line 632, in cryosparc_master.cryosparc_compute.jobs.class2D.newrun.class2D_engine_run.work
File “cryosparc_master/cryosparc_compute/engine/newengine.py”, line 1619, in cryosparc_master.cryosparc_compute.engine.newengine.EngineThread.find_best_pose_shift_class
File “<array_function internals>”, line 5, in unravel_index
ValueError: index -1089082060 is out of bounds for array with size 336

We have ran the same 2D class job parameters (not cloned) and get this error at different points during iterations.

Along with that error, we also get an unresponsive heartbeat termination for some of the same 2D class jobs as well. Any advice would be greatly appreciated.

Thanks!

Thanks for the update @carterwheat.

What cache is being cleared and what is the command being used?
Regarding the new errors you observed:

  1. What is your version of CryoSPARC?
  2. What are the outputs of the following commands in a fresh shell?
    cat /sys/kernel/mm/transparent_hugepage/enabled
    sudo dmesg -T | grep -i error
    eval $(cryosparcm env)
    host $CRYOSPARC_MASTER_HOSTNAME
    time curl ${CRYOSPARC_MASTER_HOSTNAME}:$CRYOSPARC_COMMAND_CORE_PORT
    
    You may want to exit the shell after having recorded the commands’ outputs to avoid inadvertently running general commands inside the CryoSPARC environment.

Hi @wtempel ,

I am running v4.4.1.

In a fresh shell:

bell@ub22-04:~$ cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never
bell@ub22-04:~$ sudo dmesg -T | grep -i error

(no output)

bell@ub22-04:~$ eval $(cryosparcm env)

(no output)

bell@ub22-04:~$ host $CRYOSPARC_MASTER_HOSTNAME
ub22-04 has address 10.69.108.35
bell@ub22-04:~$ time curl ${CRYOSPARC_MASTER_HOSTNAME}:
$CRYOSPARC_COMMAND_CORE_PORT

Hello World from cryosparc command core.

real	0m0.023s
user	0m0.005s
sys	0m0.005s

not sure if this is the correct way to do so, but the command I’ve been clearing the cache with:
sync; echo 1 > /proc/sys/vm/drop_caches

Regarding

you may want to check for corrupt particles, as suggested in 2D Classification: ValueError: index is out of bounds for array - #23 by hgxy15. You may use the Check for Corrupt Particles job type and enable Check for NaN values. If the test passes and you enabled Cache particle images on SSD for classification, particles may have been corrupted in the cache.

Thanks for your reply @wtempel

I did run a check particles job (with NaN option) on the same particle stack and no corruption was detected. We have been turning off SSD caching as well for every job applicable.

If it helps, 2D classification will run fine with a fraction of the particles (about 1 million), but the full dataset (about 13 million) results in the issues aforementioned.

Interesting. I wonder if each 1M or so partition of the 13M set would succeed individually, or if there were at least one partition that would also fail. Have you tried that?

Hi @wtempel ,

I tried splitting the particle stack into smaller groups to run 2D classification individually, but unfortunately we were met with buffering of the cryosparc session. I assigned 3 2D class jobs (2.5 M particles each), each with 1 GPU and ultimately received the sock connection error:

cryosparcm status
----------------------------------------------------------------------------
CryoSPARC System master node installed at
/home/bell/programs/cryosparc_master
Current cryoSPARC version: v4.4.1
----------------------------------------------------------------------------

CryoSPARC process status:

unix:///tmp/cryosparc-supervisor-2bd2e4ee751475e1d6470e25365ba9c5.sock refused connection

----------------------------------------------------------------------------

So it seems I have not fully addressed this issue quite yet. Any suggestions on what I can try next?

Thanks again!

What are the outputs of these commands when the sock file refuses connection?

free -g
date
ps -eo user,pid,ppid,start,rsz,vsz,cmd | grep -e cryosparc_ -e mongo | grep -v grep
date
ls -l /tmp/cryosparc*.sock /tmp/mongodb-*.sock

Thanks for your reply @wtempel

Here is the requested output

bell@ub22-04:~/useful-scripts$ date
Mon Apr 15 05:07:15 PM EDT 2024
bell@ub22-04:~/useful-scripts$ ps -eo user,pid,ppid,start,rsz,vsz,cmd | grep -e cryosparc_ -e mongo | grep -v grep
bell@ub22-04:~/useful-scripts$ date
Mon Apr 15 05:07:50 PM EDT 2024
bell@ub22-04:~/useful-scripts$ ls -l /tmp/cryosparc*.sock /tmp/mongodb-*.sock
srwx------ 1 bell bell 0 Apr 15 14:17 /tmp/cryosparc-supervisor-2bd2e4ee751475e1d6470e25365ba9c5.sock
srwx------ 1 bell bell 0 Apr 15 14:17 /tmp/mongodb-39001.sock
bell@ub22-04:~/useful-scripts$

The empty ps -e output suggests that CryoSPARC-related processes have exited, but socket files where (unexpectedly) left behind.
What is the output of the command

cryosparcm log supervisord | tail -n 20

?

@wtempel here is the output.

this is while cryosparc is buffering/sock connection issues are happening

bell@ub22-04:~$ cryosparcm log supervisord | tail -n 20
2024-04-15 10:11:08,026 INFO spawned: 'app' with pid 62991
2024-04-15 10:11:09,695 INFO success: app entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-04-15 10:11:09,855 INFO spawned: 'app_api' with pid 63009
2024-04-15 10:11:11,206 INFO success: app_api entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-04-15 14:17:02,685 INFO RPC interface 'supervisor' initialized
2024-04-15 14:17:02,685 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2024-04-15 14:17:02,687 INFO daemonizing the supervisord process
2024-04-15 14:17:02,687 INFO supervisord started with pid 69353
2024-04-15 14:17:07,480 INFO spawned: 'database' with pid 69467
2024-04-15 14:17:09,355 INFO success: database entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-04-15 14:17:11,321 INFO spawned: 'command_core' with pid 69578
2024-04-15 14:17:17,012 INFO success: command_core entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)
2024-04-15 14:17:17,773 INFO spawned: 'command_vis' with pid 69611
2024-04-15 14:17:18,775 INFO success: command_vis entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-04-15 14:17:18,923 INFO spawned: 'command_rtp' with pid 69639
2024-04-15 14:17:19,924 INFO success: command_rtp entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-04-15 14:17:23,709 INFO spawned: 'app' with pid 69696
2024-04-15 14:17:25,379 INFO success: app entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-04-15 14:17:25,539 INFO spawned: 'app_api' with pid 69714
2024-04-15 14:17:26,827 INFO success: app_api entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)