Random worker failures in latest Live version


After updating to 3.1, we have been seeing random, frequent failures of Live workers - they fail with the attached message and then restart themselves. Thoughts?


We see the same thing.

Hi @olibclarke @carthur,

To confirm, when you see this failure happening, does the worker completely restart, or just skip the current movie and go on to the next one?
If it does restart, does it return to processing the same movie and succeed the second time? Or does it continue failing on the same movie and then mark that movie as failed and move on?

For us, what I notice happening is that I will kick off a run with multiple processors (maybe 6), and over time it will fail, then restart, then fail, then restart, etc. and by the time things stabilize it will be running on 4 of the 6 processors (the others will have finally failed and not restarted). It doesn’t seem to be failing on any particular movie.

This is what we see too

That is strange… just to clarify, processing will eventually succeed on a movie where it previously failed? I just want to absolutely rule out that the files are corrupted…

Do either of you have an example of a movie that always fails?

Another thing you could try: check the job log of one of the workers that ran into this issue, and see if the any error/warning messages from libtiff.

Yes, it subsequently starts on and succeeds in processing the failed movie each time. It seems to be random - I don’t think we have a case of a specific movie that always fails. Will check the job log of the worker