Hi,
Since my updating from 3.2 to 4.1 failed horribly, I decided to freshly install 4.1 using a different port on same standalone machine.
Even after the fresh install 4.1 is giving some weird error which never happened before on the same standalone server with 3.2. For example when I’m trying to run ‘patch motioncor’ its failing without any error.
We are seeing this sporadically on our CentOS system. Often it will fail this way, but then run correctly on restart. Weird. Happens on non-GPU jobs too (e.g. import volumes etc), so presumably not a CUDA issue.
I’m on RockyLinux and restarting is not working too. Running programs are mostly refinements, import jobs, cpu extract. Not running/failing programs are patch motioncor, extract with gpu. That’s all I could test with 4.1.
I’m not sure if should rollback to 3.2 given the scenario !!
We are seeing this same issue across several standalone nodes as well. Updated from v4.0.3 to v4.1.0 on both Scientific Linux 7 and Rocky Linux 8.
Possibly related, the Extensive Workflow benchmark test also fails now spontaneously at random jobs with the same “job process terminated abnormally” error. I usually run a full benchmark for each server with any new update, but so far have been unable to successfully finish any run with v4.1.0.
Thanks for confirming. Only solution I could achieve to let the work continue is to roll-back to 4.0, which is working great.
Hopefully this gets resolved soon. I really wanted to give the flex-refinement a try.
Thanks.
We are aware (and in the process of fixing) an issue that
we have reproduced on centOS-7
fails to print a meaningful error message to job.log
would have occurred on 4.0.X versions also
You can help us to better define the scope of the problem and to potentially provide a more comprehensive fix by including in your post, if you have not already:
the error message from the event log (with preceding lines for context)