Connection Refused: Server is not accepting new connections

Dear Cryosparc users and developers,

I am currently experiencing an issue on v4.0.1+221017 installed and running in a cluster environment using a distributed file system (cephFS). Jobs launched via a submission slurm script will prematurely terminate. We have done some troubleshooting and noticed that the job.log will contain several instances of the following error:

Connection Refused: Server at http://:/api is not accepting new connections. Is the server running?

Here I have omitted the server and port name which are listed correctly on my system.

After three independent such events the job terminates and is killed due to a lack of heartbeat. It seems tied to traffic on the shared file system as these errors tend not to occur during times of lower overall utilization. Has anyone encountered something like this or have any ideas for further troubleshooting?

Best regards,
Victor Tobiasson

Welcome to the forum @VictorTobiasson

To rule out that the problem is linked to certain CryoSPARC job types or versions, please can you confirm that a clone of a successful job fails under high “overall utilization” with the same CryoSPARC version and patch level.

Hi,

Yes, this issue is persistent both for resubmission’s of the same jobs, new identical runs and all job types which require prolonged worker activities such as 2Ds, 3D, refinements, or motion correction but not subset selections, for example. As far as we can tell there seems to be some instability in the communication between the worker and the master as jobs are killed due to a lack of heartbeat following three independent occurrences of the aforementioned API connection refusal.

V

With said server name and port number, please can you post the output of these commands

  1. On the master:
    curl 127.0.0.1:<port>
    for example: curl 127.0.0.1:39002
  2. On a cluster node:
    curl <server>:<port>
    for example: curl csmaster:39002

Hi,

Thanks for the help.

HTML from the server running the master "curl 127.0.0.1:<port> "

CryoSPARC
You need to enable JavaScript to run this app.

HTML from allocated worker node while running an allocated job "curl <master_server>:<port>"

CryoSPARC
You need to enable JavaScript to run this app.

To clarify the master is running on a core login node with limited processing capabilities. The workers are independent processing nodes linked using ‘cryospacm cluster connect’. Jobs are dynamically allocated onto worker nodes using an sbatch submission script. Graphics is piped back to a local machine via portforwarded ssh

Raw output

HTML from the server running the master

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="UTF-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <link rel="apple-touch-icon" sizes="180x180" href="/apple-touch-icon.png" />
    <link rel="icon" type="image/png" sizes="32x32" href="/favicon-32x32.png" />
    <link rel="icon" type="image/png" sizes="16x16" href="/favicon-16x16.png" />
    <link rel="manifest" href="/site.webmanifest" crossorigin="use-credentials" />
    <meta name="theme-color" content="#2563eb" />
    <title>CryoSPARC</title>
    <script>
      if (localStorage.getItem('cryosparc_dark') === 'true') {
        document.querySelector('html').classList.add('dark');
      } else {
        document.querySelector('html').classList.remove('dark');
      }
    </script>
    <script type="module" crossorigin src="/assets/index.8c7c23d8.js"></script>
    <link rel="modulepreload" href="/assets/vendor.5a86cc54.js">
    <link rel="stylesheet" href="/assets/index.3af1f6d6.css">
  </head>
  <body>
    <div id="app"></div>
    <noscript>You need to enable JavaScript to run this app.</noscript>
    <script id="state">__CRYOSPARC__={user:null,ddpToken:null,runningVersion:null,jobTypesAvailable:null,lanes:null,targets:null,keycloakAuthEnabled:undefined}</script>
    
  </body>
</html>

HTML from allocated worker node while running

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="UTF-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <link rel="apple-touch-icon" sizes="180x180" href="/apple-touch-icon.png" />
    <link rel="icon" type="image/png" sizes="32x32" href="/favicon-32x32.png" />
    <link rel="icon" type="image/png" sizes="16x16" href="/favicon-16x16.png" />
    <link rel="manifest" href="/site.webmanifest" crossorigin="use-credentials" />
    <meta name="theme-color" content="#2563eb" />
    <title>CryoSPARC</title>
    <script>
      if (localStorage.getItem('cryosparc_dark') === 'true') {
        document.querySelector('html').classList.add('dark');
      } else {
        document.querySelector('html').classList.remove('dark');
      }
    </script>
    <script type="module" crossorigin src="/assets/index.8c7c23d8.js"></script>
    <link rel="modulepreload" href="/assets/vendor.5a86cc54.js">
    <link rel="stylesheet" href="/assets/index.3af1f6d6.css">
  </head>
  <body>
    <div id="app"></div>
    <noscript>You need to enable JavaScript to run this app.</noscript>
    <script id="state">__CRYOSPARC__={user:null,ddpToken:null,runningVersion:null,jobTypesAvailable:null,lanes:null,targets:null,keycloakAuthEnabled:undefined}</script>
    
  </body>
</html>

… is the expected output for the port number of the CryoSPARC browser interface.
Could the port number omitted from the error message you posted earlier

have been different?
What is the output of the curl commands after incrementing the port number by 2?

Thank you for your quick response the API port which is quoted in the error is indeed incremented by 2 relative to the “CRYOSPARC_BASE_PORT”.

curl on master with correct port:

Hello World from cryosparc command core.

curl on allocated worker with correct port:

Hello World from cryosparc command core.

Hi again,

We keep struggling to debug this issue further. What exactly is the API part of the master responsible for during normal conditions and what possible reasons could the master have for refusing a connection? An obvious solution is if there is any way to increase the maximum allowed Connections Refused but so far we have yet to find such a setting.

Best,
Victor

Please can you email us the error report for a job that has failed in this way.

Based on this observation and the expected responses from the command core port,

I hypothesize that that the problem is caused by excessive load on either the CryoSPARC master host or the network. A possible intervention would be to significantly increase the allowed heartbeat interval.
This can be done by adding the line
export CRYOSPARC_HEARTBEAT_SECONDS=600
to the end of cryosparc_master/config.sh.
The new setting should become active after cryosparcm restart.

Hi wtempel,

Thanks for checking this. We have already tried prolonging the heartbeat timeout. This does not fix the issue as the worker considers the connection to cryosparc command lost after 3 failed connections. No more heartbeats will be sent by the worker. The master is waiting for the heartbeat timeout to run out.

Example output from cryosparcm joblog PX JX

========= sending heartbeat
Connection Refused: Server at http://login2.tcblab:39342/api is not accepting new connections. Is the server running?
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
Connection Refused: Server at http://login2.tcblab:39342/api is not accepting new connections. Is the server running?
 ************* Connection to cryosparc command lost.

This is the final output and the master eventually terminates the job after the heartbeat timeout.

Best,
V

1 Like

CryoSPARC v4.1.0 includes changes to heartbeat checking that may mitigate this problem. As of December 2022, I recommend updating to v4.1.1.