Our CryoSPARC instance is unresponsive. It is possible to navigate the projects and jobs, but creating new jobs or edit them does not work or takes very long time. The problem is a bit similar to this thread: https://discuss.cryosparc.com/t/webgui-partly-unresponsive-freezing-cryosparc/9110/3
In the logs, it seems to be related to a project which only had a failed live session. The project, P193, does not show up in the “All projects” (shoebox) listing, but a live session S1 shows up in the live session listing (lightning bolt). The project folder is still there and has two folders S1 and S3.
Upon restart, the contents of command_rtp shows that migration of live sessions proceeds until it reaches this problematic project. See below for selected parts of the log. Errors about background worker “socket.timeout: timed out” keeps repeating.
I think that purging this failed live session (or the entire project) from the data base might fix the problem. How can I do that?
The command_core log indicate some other projects which might be problematic, but I suspect that is due to failed jobs when the server was shut down (killed the supervisor process). See below for selected parts of that log.
Regards,
Daniel
From command_rtp.log:
2023-06-21 10:00:20,908 RTP.MAIN start INFO | === STARTED ===
2023-06-21 10:00:20,909 RTP.BG_WORKER background_worker INFO | === STARTED ===
* Serving Flask app "command_rtp" (lazy loading)
* Environment: production
WARNING: This is a development server. Do not use it in a production deployment.
Use a production WSGI server instead.
* Debug mode: off
2023-06-21 10:01:21,871 RTP.MAIN migrate_old_sessions_run INFO | Finished migrating P21 S3 in 0.00s
2023-06-21 10:01:21,871 RTP.MAIN migrate_old_sessions_run INFO | Finished migrating P22 S3 in 0.00s
2023-06-21 10:01:21,871 RTP.MAIN migrate_old_sessions_run INFO | Finished migrating P23 S3 in 0.00s
*snip*
2023-06-21 10:01:28,325 RTP.MAIN migrate_old_sessions_run INFO | Finished migrating P192 S3 in 0.02s
2023-06-21 10:01:28,327 RTP.MAIN create_live_session_job INFO | Creating Live Session job for P193 S1
2023-06-21 10:05:26,024 RTP.BG_WORKER background_worker ERROR | RTP Child Monitor Failed
2023-06-21 10:05:26,024 RTP.BG_WORKER background_worker ERROR | Traceback (most recent call last):
2023-06-21 10:05:26,024 RTP.BG_WORKER background_worker ERROR | File "/home/cryosparcuser/cryosparc/cryosparc_master/cryosparc_command/command_rtp/__init__.py", line 113, in background_worker
2023-06-21 10:05:26,024 RTP.BG_WORKER background_worker ERROR | rtp_child_job_monitor()
2023-06-21 10:05:26,024 RTP.BG_WORKER background_worker ERROR | File "/home/cryosparcuser/cryosparc/cryosparc_master/cryosparc_command/commandcommon.py", line 191, in wrapper
2023-06-21 10:05:26,024 RTP.BG_WORKER background_worker ERROR | return func(*args, **kwargs)
2023-06-21 10:05:26,024 RTP.BG_WORKER background_worker ERROR | File "/home/cryosparcuser/cryosparc/cryosparc_master/cryosparc_command/command_rtp/__init__.py", line 2800, in rtp_child_job_monitor
2023-06-21 10:05:26,024 RTP.BG_WORKER background_worker ERROR | new_status = cli.get_job_status(session['project_uid'], job['uid'])
2023-06-21 10:05:26,024 RTP.BG_WORKER background_worker ERROR | File "/home/cryosparcuser/cryosparc/cryosparc_master/cryosparc_tools/cryosparc/command.py", line 104, in func
2023-06-21 10:05:26,024 RTP.BG_WORKER background_worker ERROR | with make_json_request(self, "/api", data=data) as request:
2023-06-21 10:05:26,024 RTP.BG_WORKER background_worker ERROR | File "/home/cryosparcuser/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/contextlib.py", line 113, in __enter__
2023-06-21 10:05:26,024 RTP.BG_WORKER background_worker ERROR | return next(self.gen)
2023-06-21 10:05:26,024 RTP.BG_WORKER background_worker ERROR | File "/home/cryosparcuser/cryosparc/cryosparc_master/cryosparc_tools/cryosparc/command.py", line 165, in make_request
2023-06-21 10:05:26,024 RTP.BG_WORKER background_worker ERROR | with urlopen(request, timeout=client._timeout) as response:
2023-06-21 10:05:26,024 RTP.BG_WORKER background_worker ERROR | File "/home/cryosparcuser/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/urllib/request.py", line 222, in urlopen
2023-06-21 10:05:26,024 RTP.BG_WORKER background_worker ERROR | return opener.open(url, data, timeout)
2023-06-21 10:05:26,024 RTP.BG_WORKER background_worker ERROR | File "/home/cryosparcuser/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/urllib/request.py", line 525, in open
2023-06-21 10:05:26,024 RTP.BG_WORKER background_worker ERROR | response = self._open(req, data)
2023-06-21 10:05:26,024 RTP.BG_WORKER background_worker ERROR | File "/home/cryosparcuser/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/urllib/request.py", line 542, in _open
2023-06-21 10:05:26,024 RTP.BG_WORKER background_worker ERROR | result = self._call_chain(self.handle_open, protocol, protocol +
2023-06-21 10:05:26,024 RTP.BG_WORKER background_worker ERROR | File "/home/cryosparcuser/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/urllib/request.py", line 502, in _call_chain
2023-06-21 10:05:26,024 RTP.BG_WORKER background_worker ERROR | result = func(*args)
2023-06-21 10:05:26,024 RTP.BG_WORKER background_worker ERROR | File "/home/cryosparcuser/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/urllib/request.py", line 1383, in http_open
2023-06-21 10:05:26,024 RTP.BG_WORKER background_worker ERROR | return self.do_open(http.client.HTTPConnection, req)
2023-06-21 10:05:26,024 RTP.BG_WORKER background_worker ERROR | File "/home/cryosparcuser/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/urllib/request.py", line 1358, in do_open
2023-06-21 10:05:26,024 RTP.BG_WORKER background_worker ERROR | r = h.getresponse()
2023-06-21 10:05:26,024 RTP.BG_WORKER background_worker ERROR | File "/home/cryosparcuser/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/http/client.py", line 1348, in getresponse
2023-06-21 10:05:26,024 RTP.BG_WORKER background_worker ERROR | response.begin()
2023-06-21 10:05:26,024 RTP.BG_WORKER background_worker ERROR | File "/home/cryosparcuser/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/http/client.py", line 316, in begin
2023-06-21 10:05:26,024 RTP.BG_WORKER background_worker ERROR | version, status, reason = self._read_status()
2023-06-21 10:05:26,024 RTP.BG_WORKER background_worker ERROR | File "/home/cryosparcuser/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/http/client.py", line 277, in _read_status
2023-06-21 10:05:26,024 RTP.BG_WORKER background_worker ERROR | line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
2023-06-21 10:05:26,024 RTP.BG_WORKER background_worker ERROR | File "/home/cryosparcuser/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/socket.py", line 669, in readinto
2023-06-21 10:05:26,024 RTP.BG_WORKER background_worker ERROR | return self._sock.recv_into(b)
2023-06-21 10:05:26,024 RTP.BG_WORKER background_worker ERROR | socket.timeout: timed out
2023-06-21 10:06:21,961 COMMAND.COMMON run ERROR | POST-RESPONSE-THREAD ERROR at dump_all_live_sessions_run
2023-06-21 10:06:21,961 COMMAND.COMMON run ERROR | Traceback (most recent call last):
2023-06-21 10:06:21,961 COMMAND.COMMON run ERROR | File "/home/cryosparcuser/cryosparc/cryosparc_master/cryosparc_command/commandcommon.py", line 78, in run
2023-06-21 10:06:21,961 COMMAND.COMMON run ERROR | self.target(*self.args)
2023-06-21 10:06:21,961 COMMAND.COMMON run ERROR | File "/home/cryosparcuser/cryosparc/cryosparc_master/cryosparc_command/command_rtp/__init__.py", line 398, in dump_all_live_sessions_run
2023-06-21 10:06:21,961 COMMAND.COMMON run ERROR | all_projects = cli.list_projects()
2023-06-21 10:06:21,961 COMMAND.COMMON run ERROR | File "/home/cryosparcuser/cryosparc/cryosparc_master/cryosparc_tools/cryosparc/command.py", line 104, in func
2023-06-21 10:06:21,961 COMMAND.COMMON run ERROR | with make_json_request(self, "/api", data=data) as request:
2023-06-21 10:06:21,961 COMMAND.COMMON run ERROR | File "/home/cryosparcuser/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/contextlib.py", line 113, in __enter__
2023-06-21 10:06:21,961 COMMAND.COMMON run ERROR | return next(self.gen)
2023-06-21 10:06:21,961 COMMAND.COMMON run ERROR | File "/home/cryosparcuser/cryosparc/cryosparc_master/cryosparc_tools/cryosparc/command.py", line 165, in make_request
2023-06-21 10:06:21,961 COMMAND.COMMON run ERROR | with urlopen(request, timeout=client._timeout) as response:
2023-06-21 10:06:21,961 COMMAND.COMMON run ERROR | File "/home/cryosparcuser/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/urllib/request.py", line 222, in urlopen
2023-06-21 10:06:21,961 COMMAND.COMMON run ERROR | return opener.open(url, data, timeout)
2023-06-21 10:06:21,961 COMMAND.COMMON run ERROR | File "/home/cryosparcuser/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/urllib/request.py", line 525, in open
2023-06-21 10:06:21,961 COMMAND.COMMON run ERROR | response = self._open(req, data)
2023-06-21 10:06:21,961 COMMAND.COMMON run ERROR | File "/home/cryosparcuser/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/urllib/request.py", line 542, in _open
2023-06-21 10:06:21,961 COMMAND.COMMON run ERROR | result = self._call_chain(self.handle_open, protocol, protocol +
2023-06-21 10:06:21,961 COMMAND.COMMON run ERROR | File "/home/cryosparcuser/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/urllib/request.py", line 502, in _call_chain
2023-06-21 10:06:21,961 COMMAND.COMMON run ERROR | result = func(*args)
2023-06-21 10:06:21,961 COMMAND.COMMON run ERROR | File "/home/cryosparcuser/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/urllib/request.py", line 1383, in http_open
2023-06-21 10:06:21,961 COMMAND.COMMON run ERROR | return self.do_open(http.client.HTTPConnection, req)
2023-06-21 10:06:21,961 COMMAND.COMMON run ERROR | File "/home/cryosparcuser/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/urllib/request.py", line 1358, in do_open
2023-06-21 10:06:21,961 COMMAND.COMMON run ERROR | r = h.getresponse()
2023-06-21 10:06:21,961 COMMAND.COMMON run ERROR | File "/home/cryosparcuser/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/http/client.py", line 1348, in getresponse
2023-06-21 10:06:21,961 COMMAND.COMMON run ERROR | response.begin()
2023-06-21 10:06:21,961 COMMAND.COMMON run ERROR | File "/home/cryosparcuser/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/http/client.py", line 316, in begin
2023-06-21 10:06:21,961 COMMAND.COMMON run ERROR | version, status, reason = self._read_status()
2023-06-21 10:06:21,961 COMMAND.COMMON run ERROR | File "/home/cryosparcuser/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/http/client.py", line 277, in _read_status
2023-06-21 10:06:21,961 COMMAND.COMMON run ERROR | line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
2023-06-21 10:06:21,961 COMMAND.COMMON run ERROR | File "/home/cryosparcuser/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/socket.py", line 669, in readinto
2023-06-21 10:06:21,961 COMMAND.COMMON run ERROR | return self._sock.recv_into(b)
2023-06-21 10:06:21,961 COMMAND.COMMON run ERROR | socket.timeout: timed out
From command_core.log:
2023-06-21 10:00:08,073 COMMAND.MAIN start INFO | === STARTED ===
2023-06-21 10:00:08,073 COMMAND.BG_WORKER background_worker INFO | === STARTED ===
2023-06-21 10:00:08,073 COMMAND.CORE run INFO | === STARTED TASKS WORKER ===
* Serving Flask app "command_core" (lazy loading)
* Environment: production
WARNING: This is a development server. Do not use it in a production deployment.
Use a production WSGI server instead.
* Debug mode: off
2023-06-21 10:00:08,681 COMMAND.MAIN startup INFO | Starting CryoSPARC v4.2.1+230427
2023-06-21 10:00:08,682 COMMAND.MAIN startup INFO | platform_node : donatello
2023-06-21 10:00:08,682 COMMAND.MAIN startup INFO | platform_release : 3.10.0-1160.el7.x86_64
2023-06-21 10:00:08,682 COMMAND.MAIN startup INFO | platform_version : #1 SMP Mon Oct 19 16:18:59 UTC 2020
2023-06-21 10:00:08,682 COMMAND.MAIN startup INFO | platform_architecture : x86_64
2023-06-21 10:00:08,682 COMMAND.MAIN startup INFO | physical_cores : 24
2023-06-21 10:00:08,682 COMMAND.MAIN startup INFO | max_cpu_freq : 3600.0
2023-06-21 10:00:08,682 COMMAND.MAIN startup INFO | total_memory : 503.35GB
2023-06-21 10:00:08,682 COMMAND.MAIN startup INFO | available_memory : 480.98GB
2023-06-21 10:00:08,682 COMMAND.MAIN startup INFO | used_memory : 20.71GB
2023-06-21 10:00:08,682 COMMAND.MAIN startup INFO | version : v4.2.1+230427
2023-06-21 10:00:10,191 COMMAND.STARTUP startup INFO | CryoSPARC instance ID: 7e154680-a3bc-4ea8-a8ef-c89b08459db3
2023-06-21 10:00:10,191 COMMAND.SCHEDULER get_gpu_info INFO | UPDATING WORKER GPU INFO
2023-06-21 10:00:10,191 COMMAND.JOBS update_all_job_sizes INFO | UPDATING ALL JOB SIZES IN 10s
2023-06-21 10:00:10,191 COMMAND.DATA export_all_projects INFO | EXPORTING ALL PROJECTS IN 60s...
2023-06-21 10:00:14,089 COMMAND.HEARTBEAT check_heartbeats WARNING | Marking P195 J85 as failed
2023-06-21 10:00:14,090 COMMAND.JOBS set_job_status INFO | Status changed for P195.J85 from running to failed
2023-06-21 10:00:14,159 COMMAND.JOBS app_stats_refresh WARNING | Failed to call stats refresh endpoint for P195 J85: HTTPConnectionPool(host='donatello', port=29440): Max retries exceeded with url: /api/actions/stats/refresh_job (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fcd7d5c3340>: Failed to establish a new connection: [Errno 111] Connection refused'))
2023-06-21 10:00:14,194 COMMAND.HEARTBEAT check_heartbeats WARNING | Marking P200 J37 as failed
2023-06-21 10:00:14,195 COMMAND.JOBS set_job_status INFO | Status changed for P200.J37 from waiting to failed
2023-06-21 10:00:14,198 COMMAND.JOBS app_stats_refresh WARNING | Failed to call stats refresh endpoint for P200 J37: HTTPConnectionPool(host='donatello', port=29440): Max retries exceeded with url: /api/actions/stats/refresh_job (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fcd7d5fc2e0>: Failed to establish a new connection: [Errno 111] Connection refused'))
2023-06-21 10:00:14,200 COMMAND.HEARTBEAT check_heartbeats WARNING | Marking P199 J105 as failed
2023-06-21 10:00:14,201 COMMAND.JOBS set_job_status INFO | Status changed for P199.J105 from waiting to failed
2023-06-21 10:00:14,204 COMMAND.JOBS app_stats_refresh WARNING | Failed to call stats refresh endpoint for P199 J105: HTTPConnectionPool(host='donatello', port=29440): Max retries exceeded with url: /api/actions/stats/refresh_job (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fcd7d388160>: Failed to establish a new connection: [Errno 111] Connection refused'))
2023-06-21 10:00:14,206 COMMAND.HEARTBEAT check_heartbeats WARNING | Marking P199 J110 as failed
2023-06-21 10:00:14,207 COMMAND.JOBS set_job_status INFO | Status changed for P199.J110 from waiting to failed
2023-06-21 10:00:14,210 COMMAND.JOBS app_stats_refresh WARNING | Failed to call stats refresh endpoint for P199 J110: HTTPConnectionPool(host='donatello', port=29440): Max retries exceeded with url: /api/actions/stats/refresh_job (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fcd7d2b37f0>: Failed to establish a new connection: [Errno 111] Connection refused'))
2023-06-21 10:00:14,211 COMMAND.HEARTBEAT check_heartbeats WARNING | Marking P194 J151 as failed
2023-06-21 10:00:14,212 COMMAND.JOBS set_job_status INFO | Status changed for P194.J151 from waiting to failed
2023-06-21 10:00:14,215 COMMAND.JOBS app_stats_refresh WARNING | Failed to call stats refresh endpoint for P194 J151: HTTPConnectionPool(host='donatello', port=29440): Max retries exceeded with url: /api/actions/stats/refresh_job (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fcd7d391130>: Failed to establish a new connection: [Errno 111] Connection refused'))