Error: can't start new thread

bowman · December 25, 2018, 5:18pm

Hello,

I recently set up a cryosparc v2 single-workstation install on a node with the following specifications:

CPU: 2x Xeon Gold 6138 (20c each)
RAM: 384 GB
GPU: 8x RTX 2080

However, any time more than four jobs are queued, the fifth job always throws this error:

I am able to thread jobs across multiple GPUs, but we just can’t launch more than four jobs at a time. I don’t see anything in the installation guide or here about this, and am wondering if I somehow incorrectly configured the installation.

Thanks,

Charlie

bowman · December 27, 2018, 4:07am

A quick follow up - I am noticing that some of the jobs are throwing the following errors:

Job is unresponsive - no heartbeat received in 30 seconds.

The job then shows as “Failed”, and removes itself from the queue.

However, for some reason the job is still running in the background after the heartbeat error was thrown, and runs to completion (while taking up resources in the background - not showing in queue). I believe this is causing the resource allocation problem.

apunjani · January 11, 2019, 5:51pm

Hi @bowman, thanks for reporting this. We have found a bug that would cause num_cpus*2 redundant threads to be created for every cryoSPARC worker process, we will fix that. The heartbeat error is probably a separate problem. Can you see where during a processing run (and in which job types) the heartbeat error occurs?

bowman · January 11, 2019, 8:36pm

Thanks for the follow up Ali,

We isolated it to a cache issue - when Cryosparc reads files the kernel was caching them into memory (which is not needed). We clear unused cache constantly and it has stopped the heartbeat errors.