2D class average getting stuck

I am running standard 2D classification with multi-GPU threads using v2.13.2. The process went fine, but always got stuck in the last iteration.

In my most recent attempt, I used 2 GPUs and 40 iterations, the final output before stuck is
“[CPU: 27.72 GB] Start of Iteration 40”
“[CPU: 27.72 GB] – DEV 1 THR 0 NUM 11000 TOTAL 97.084745 ELAPSED 98.070966”

(I tried multitple times, but it always got stuck with this same place, NUM 11000.)

If I switch to single GPU, the run went into completion.

Any suggestions as what would be the problem here?

Thanks,

Pei

might not be the correct answer, but it might be that you run out memory because the last iteration is much more demanding. How many classes you have ? Try reduce them and repeat with same parameters, maybe it will work.

Hi @peizhou,

@marino-j is probably right - though why this happens only in the multi-GPU case is a mystery. We will try to replicate this error here but have not seen it so far.