CS 4.1 job failing abruptly

diffracteD · December 17, 2022, 8:05pm

Hi,
Since my updating from 3.2 to 4.1 failed horribly, I decided to freshly install 4.1 using a different port on same standalone machine.
Even after the fresh install 4.1 is giving some weird error which never happened before on the same standalone server with 3.2. For example when I’m trying to run ‘patch motioncor’ its failing without any error.

Please advise how to fix this or I should rollback to 3.2.
Thanks

olibclarke · December 17, 2022, 8:07pm

We are seeing this sporadically on our CentOS system. Often it will fail this way, but then run correctly on restart. Weird. Happens on non-GPU jobs too (e.g. import volumes etc), so presumably not a CUDA issue.

diffracteD · December 17, 2022, 8:17pm

I’m on RockyLinux and restarting is not working too. Running programs are mostly refinements, import jobs, cpu extract. Not running/failing programs are patch motioncor, extract with gpu. That’s all I could test with 4.1.
I’m not sure if should rollback to 3.2 given the scenario !!

jonathanjih · December 19, 2022, 6:19am

We are seeing this same issue across several standalone nodes as well. Updated from v4.0.3 to v4.1.0 on both Scientific Linux 7 and Rocky Linux 8.

Possibly related, the Extensive Workflow benchmark test also fails now spontaneously at random jobs with the same “job process terminated abnormally” error. I usually run a full benchmark for each server with any new update, but so far have been unable to successfully finish any run with v4.1.0.

diffracteD · December 19, 2022, 7:27pm

Thanks for confirming. Only solution I could achieve to let the work continue is to roll-back to 4.0, which is working great.
Hopefully this gets resolved soon. I really wanted to give the flex-refinement a try.
Thanks.

wtempel · December 19, 2022, 8:40pm

We are aware (and in the process of fixing) an issue that

we have reproduced on centOS-7
fails to print a meaningful error message to job.log
would have occurred on 4.0.X versions also

You can help us to better define the scope of the problem and to potentially provide a more comprehensive fix by including in your post, if you have not already:

the error message from the event log (with preceding lines for context)
messages from job.log
output of uname -a

jonathanjih · December 20, 2022, 9:53pm

Please see below for our printouts. The example is from a 2D classification, but this error seems to pop up randomly for all job types.

Job type: 2D Classification

Thanks!

wtempel · December 20, 2022, 10:46pm

CryoSPARC v4.1.1 has been released and includes a fix for this problem.

jonathanjih · December 22, 2022, 9:44pm

@wtempel Thanks for the follow up!