Http timeout issue of cryosparc

Dear all,

We has encountered this http timeout error message for the past three weeks. Now it almost fully blocked our cryosparc usage. Here are the error message:

We tried to add the “export CRYOSPARC_CLIENT_TIMEOUT=1800” to the config.sh under the “cryosparc_master” folder. However, it looks like this environment setting was not used after restarting cryosparc with “cryosparc restart”. We still have the error message saying “timeout=300”.

The ports 39000-39010 have been opened for both master and worker node. Trying “curl master node:39002 from the worker nodes” can return “hello world” without issue.

Now if we run 2D or 3D jobs without an SSD, the chance of successful running will increase. But with SSD, jobs usually fail before transferring data and throw out the timeout error. It’s worth noting that the particle dataset was not large at all (only 20G).

The cryosparc version we use is 3.3.2. We have 1 master node and two worker nodes.

Any suggestions would be much appreciated!

Best

1 Like

May I ask

  1. In those cases, is caching disabled via the Cache particle images on SSD job parameter?
  2. Did the error ever occur when caching was disabled?
  3. Do jobs with caching work some of the time?
  4. Are there any error messages in the joblog of the job for which you posted the screenshot above?
  5. Please can you send me the output of cryosparcm env via a forum direct message?
  6. What is the nominal bandwidth between the worker nodes and the file server that serves the project directory? Is it 1Gbps ethernet? 10Gbps?
  7. Is there any history of NFS disruptions in your network environment? Any hints in the workers’ or NFS server’s system logs?
  1. In those cases, is caching disabled via the Cache particle images on SSD job parameter?
    Yes.
  2. Did the error ever occur when caching was disabled?
    Not exactly sure. At least, these jobs worked fine when I test them, while the SSD cache jobs failed.
  3. Do jobs with caching work some of the time?
    Yes.
  4. Are there any error messages in the joblog of the job for which you posted the screenshot above?
    No error message was shown. Here are the last few lines of the job log file:
    ========= sending heartbeat
    ========= sending heartbeat
    ========= sending heartbeat
    ***************************************************************
    Running job J182 of type class_3D
    Allocated Resources : (not shown)
    *** client.py: command (http://xxx.edu:39002/api) did not reply within timeout of 300 seconds, attempt 1 of 3
    *** client.py: command (http://xxx.edu:39002/api) did not reply within timeout of 300 seconds, attempt 2 of 3
    *** client.py: command (http://xxx.edu:39002/api) did not reply within timeout of 300 seconds, attempt 3 of 3
    **** handle exception rc
    set status to failed
  5. Please can you send me the output of cryosparcm env via a forum direct message?
    Will do.
  6. What is the nominal bandwidth between the worker nodes and the file server that serves the project directory? Is it 1Gbps ethernet? 10Gbps?
    1Gbps ethernet
  7. Is there any history of NFS disruptions in your network environment? Any hints in the workers’ or NFS server’s system logs?
    We are using Lustre, but we also tried to use NFS file system. How should we get these logs?

Thanks!

Best

Please can you also define this variable in cryosparc_worker/config.sh.

Do you use Lustre for

  1. project directories?
  2. particle cache?

Yes, We have done that also. How should we make this config change take effect? We tried to directly restart cryosparc, but didn’t seem to work.

Lustre for project dir.
particle cache using SSD

Please can you also send me the output of
/path/to/cryosparc_worker/bin/cryosparcw env
via a forum direct message.

@Btz The change in cryosparc_worker/config.sh should be effective for jobs launched after the change. Do new jobs still show timeout=300?
Also:
What is the size of the $CRYOSPARC_DB_PATH directory?
On what type of storage is $CRYOSPARC_DB_PATH?

Yes. The jobs still show timeout=300 after the change in work config.sh. This is the similar as the heartbeat issue. It keeps reporting “30s no heartbeat” even if we have set the environment variable CRYOSPARC_HEARTBEAT_SECONDS=180 in both worker and master config.sh files.

The size of the database dir is 64GB.

The database dir is on NFS file system.

I am beginning to suspect a yet-to-be-identified network issue, which would have to be investigated locally.
There is also a small chance that reboots of the CryoSPARC master (after cryosparcm stop) and worker(s) may restore proper function.

We have tried to stop cryosparc and reboot master node and worker nodes. Unfortunately, it didn’t resolve the timeout issue. Do you have any clues what we should test locally for the network? Thanks for all the comments so far!

If similar tasks could be performed earlier, you may want to investigate changes around that time, for example:

  • software, OS, hardware updates on your computers
  • changes to network infrastructure

As far as know, we didn’t do any updates for software, OS, computer hardware, or networks in the past few weeks.

However, we started to use cryosparc live few weeks ago. Since then, we usually run live jobs and normal jobs together under the same cryosparc installation. Will this lead to some communication issue between worker node and master node database?

This should not be a problem unless a there is a misconfiguration or the workload overwhelms available resources. A few questions about your CryoSPARC master host:

  • How much RAM and how many CPU cores does it have?
  • Does the host handle additional tasks:
    • as a CryoSPARC worker
    • as a master for an additional CryoSPARC instance
    • non CryoSPARC workloads
  • Over what kind of network (Ethernet? How many Gbps?) is it connected to the file server that holds the CryoSPARC database?
  • How heavy is the overall load on the file server that serves the database?
  • Have any patches been applied to the CryoSPARC v3.3.2 installation?