Update: I have been using a script to clear the cache every hour and it seems to have gotten around the " … .sock refused connection" issue. However I am still having issues, now during during 2D-classification:
`Traceback (most recent call last):
File “/home/bell/programs/cryosparc_worker/cryosparc_compute/jobs/runcommon.py”, line 2192, in run_with_except_hook
run_old(*args, **kw)
File “cryosparc_master/cryosparc_compute/gpu/gpucore.py”, line 134, in cryosparc_master.cryosparc_compute.gpu.gpucore.GPUThread.run
File “cryosparc_master/cryosparc_compute/gpu/gpucore.py”, line 135, in cryosparc_master.cryosparc_compute.gpu.gpucore.GPUThread.run
File “cryosparc_master/cryosparc_compute/jobs/class2D/newrun.py”, line 632, in cryosparc_master.cryosparc_compute.jobs.class2D.newrun.class2D_engine_run.work
File “cryosparc_master/cryosparc_compute/engine/newengine.py”, line 1619, in cryosparc_master.cryosparc_compute.engine.newengine.EngineThread.find_best_pose_shift_class
File “<array_function internals>”, line 5, in unravel_index
ValueError: index -1089082060 is out of bounds for array with size 336
We have ran the same 2D class job parameters (not cloned) and get this error at different points during iterations.
Along with that error, we also get an unresponsive heartbeat termination for some of the same 2D class jobs as well. Any advice would be greatly appreciated.
You may want to exit the shell after having recorded the commands’ outputs to avoid inadvertently running general commands inside the CryoSPARC environment.
bell@ub22-04:~$ cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never
bell@ub22-04:~$ sudo dmesg -T | grep -i error
(no output)
bell@ub22-04:~$ eval $(cryosparcm env)
(no output)
bell@ub22-04:~$ host $CRYOSPARC_MASTER_HOSTNAME
ub22-04 has address 10.69.108.35
bell@ub22-04:~$ time curl ${CRYOSPARC_MASTER_HOSTNAME}:
$CRYOSPARC_COMMAND_CORE_PORT
Hello World from cryosparc command core.
real 0m0.023s
user 0m0.005s
sys 0m0.005s
not sure if this is the correct way to do so, but the command I’ve been clearing the cache with: sync; echo 1 > /proc/sys/vm/drop_caches
I did run a check particles job (with NaN option) on the same particle stack and no corruption was detected. We have been turning off SSD caching as well for every job applicable.
If it helps, 2D classification will run fine with a fraction of the particles (about 1 million), but the full dataset (about 13 million) results in the issues aforementioned.
Interesting. I wonder if each 1M or so partition of the 13M set would succeed individually, or if there were at least one partition that would also fail. Have you tried that?
I tried splitting the particle stack into smaller groups to run 2D classification individually, but unfortunately we were met with buffering of the cryosparc session. I assigned 3 2D class jobs (2.5 M particles each), each with 1 GPU and ultimately received the sock connection error:
cryosparcm status
----------------------------------------------------------------------------
CryoSPARC System master node installed at
/home/bell/programs/cryosparc_master
Current cryoSPARC version: v4.4.1
----------------------------------------------------------------------------
CryoSPARC process status:
unix:///tmp/cryosparc-supervisor-2bd2e4ee751475e1d6470e25365ba9c5.sock refused connection
----------------------------------------------------------------------------
So it seems I have not fully addressed this issue quite yet. Any suggestions on what I can try next?
The empty ps -e output suggests that CryoSPARC-related processes have exited, but socket files where (unexpectedly) left behind.
What is the output of the command
this is while cryosparc is buffering/sock connection issues are happening
bell@ub22-04:~$ cryosparcm log supervisord | tail -n 20
2024-04-15 10:11:08,026 INFO spawned: 'app' with pid 62991
2024-04-15 10:11:09,695 INFO success: app entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-04-15 10:11:09,855 INFO spawned: 'app_api' with pid 63009
2024-04-15 10:11:11,206 INFO success: app_api entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-04-15 14:17:02,685 INFO RPC interface 'supervisor' initialized
2024-04-15 14:17:02,685 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2024-04-15 14:17:02,687 INFO daemonizing the supervisord process
2024-04-15 14:17:02,687 INFO supervisord started with pid 69353
2024-04-15 14:17:07,480 INFO spawned: 'database' with pid 69467
2024-04-15 14:17:09,355 INFO success: database entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-04-15 14:17:11,321 INFO spawned: 'command_core' with pid 69578
2024-04-15 14:17:17,012 INFO success: command_core entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)
2024-04-15 14:17:17,773 INFO spawned: 'command_vis' with pid 69611
2024-04-15 14:17:18,775 INFO success: command_vis entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-04-15 14:17:18,923 INFO spawned: 'command_rtp' with pid 69639
2024-04-15 14:17:19,924 INFO success: command_rtp entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-04-15 14:17:23,709 INFO spawned: 'app' with pid 69696
2024-04-15 14:17:25,379 INFO success: app entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-04-15 14:17:25,539 INFO spawned: 'app_api' with pid 69714
2024-04-15 14:17:26,827 INFO success: app_api entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
@carterwheat Unfortunately, I was not able to confirm the (only) hypothesis I had based on your problem description
and the commands’ outputs that you so patiently provided.
The hypothesis went like this:
CryoSPARC was started as normal.
CryoSPARC processes were abruptly killed due to some event (RAM or other system load?. A mere TERM signal would have allowed for the cleanup of the
file)
The kernel “OOM killer” seemed to me a good candidate for part 2., but there appear to be no supporting log records. Please let us know if you have any additional information that would point to an alternative cause, such as if the CryoSPARC processes are running inside a container or are subject to some cluster workload manager.