Random worker failures in latest Live version

olibclarke · March 11, 2021, 3:19pm

Hi,

After updating to 3.1, we have been seeing random, frequent failures of Live workers - they fail with the attached message and then restart themselves. Thoughts?

Cheers
Oli

carthur · March 12, 2021, 3:17pm

We see the same thing.

apunjani · March 12, 2021, 3:33pm

Hi @olibclarke @carthur,

To confirm, when you see this failure happening, does the worker completely restart, or just skip the current movie and go on to the next one?
If it does restart, does it return to processing the same movie and succeed the second time? Or does it continue failing on the same movie and then mark that movie as failed and move on?

carthur · March 12, 2021, 4:17pm

For us, what I notice happening is that I will kick off a run with multiple processors (maybe 6), and over time it will fail, then restart, then fail, then restart, etc. and by the time things stabilize it will be running on 4 of the 6 processors (the others will have finally failed and not restarted). It doesn’t seem to be failing on any particular movie.

olibclarke · March 12, 2021, 4:29pm

This is what we see too

hsnyder · March 12, 2021, 5:41pm

That is strange… just to clarify, processing will eventually succeed on a movie where it previously failed? I just want to absolutely rule out that the files are corrupted…

Do either of you have an example of a movie that always fails?

hsnyder · March 12, 2021, 5:47pm

Another thing you could try: check the job log of one of the workers that ran into this issue, and see if the any error/warning messages from libtiff.

olibclarke · March 12, 2021, 6:16pm

Yes, it subsequently starts on and succeeds in processing the failed movie each time. It seems to be random - I don’t think we have a case of a specific movie that always fails. Will check the job log of the worker