I am currently experiencing an issue on v4.0.1+221017 installed and running in a cluster environment using a distributed file system (cephFS). Jobs launched via a submission slurm script will prematurely terminate. We have done some troubleshooting and noticed that the job.log will contain several instances of the following error:
Connection Refused: Server at http://:/api is not accepting new connections. Is the server running?
Here I have omitted the server and port name which are listed correctly on my system.
After three independent such events the job terminates and is killed due to a lack of heartbeat. It seems tied to traffic on the shared file system as these errors tend not to occur during times of lower overall utilization. Has anyone encountered something like this or have any ideas for further troubleshooting?
To rule out that the problem is linked to certain CryoSPARC job types or versions, please can you confirm that a clone of a successful job fails under high âoverall utilizationâ with the same CryoSPARC version and patch level.
Yes, this issue is persistent both for resubmissionâs of the same jobs, new identical runs and all job types which require prolonged worker activities such as 2Ds, 3D, refinements, or motion correction but not subset selections, for example. As far as we can tell there seems to be some instability in the communication between the worker and the master as jobs are killed due to a lack of heartbeat following three independent occurrences of the aforementioned API connection refusal.
HTML from the server running the master "curl 127.0.0.1:<port> "
CryoSPARC
You need to enable JavaScript to run this app.
HTML from allocated worker node while running an allocated job "curl <master_server>:<port>"
CryoSPARC
You need to enable JavaScript to run this app.
To clarify the master is running on a core login node with limited processing capabilities. The workers are independent processing nodes linked using âcryospacm cluster connectâ. Jobs are dynamically allocated onto worker nodes using an sbatch submission script. Graphics is piped back to a local machine via portforwarded ssh
⌠is the expected output for the port number of the CryoSPARC browser interface.
Could the port number omitted from the error message you posted earlier
have been different?
What is the output of the curl commands after incrementing the port number by 2?
We keep struggling to debug this issue further. What exactly is the API part of the master responsible for during normal conditions and what possible reasons could the master have for refusing a connection? An obvious solution is if there is any way to increase the maximum allowed Connections Refused but so far we have yet to find such a setting.
Based on this observation and the expected responses from the command core port,
I hypothesize that that the problem is caused by excessive load on either the CryoSPARC master host or the network. A possible intervention would be to significantly increase the allowed heartbeat interval.
This can be done by adding the line export CRYOSPARC_HEARTBEAT_SECONDS=600
to the end of cryosparc_master/config.sh.
The new setting should become active after cryosparcm restart.
Thanks for checking this. We have already tried prolonging the heartbeat timeout. This does not fix the issue as the worker considers the connection to cryosparc command lost after 3 failed connections. No more heartbeats will be sent by the worker. The master is waiting for the heartbeat timeout to run out.
Example output from cryosparcm joblog PX JX
========= sending heartbeat
Connection Refused: Server at http://login2.tcblab:39342/api is not accepting new connections. Is the server running?
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
Connection Refused: Server at http://login2.tcblab:39342/api is not accepting new connections. Is the server running?
************* Connection to cryosparc command lost.
This is the final output and the master eventually terminates the job after the heartbeat timeout.