2D class average getting stuck

issue_recorded
1010

#1

I am running standard 2D classification with multi-GPU threads using v2.13.2. The process went fine, but always got stuck in the last iteration.

In my most recent attempt, I used 2 GPUs and 40 iterations, the final output before stuck is
“[CPU: 27.72 GB] Start of Iteration 40”
“[CPU: 27.72 GB] – DEV 1 THR 0 NUM 11000 TOTAL 97.084745 ELAPSED 98.070966”

(I tried multitple times, but it always got stuck with this same place, NUM 11000.)

If I switch to single GPU, the run went into completion.

Any suggestions as what would be the problem here?

Thanks,

Pei


#2

might not be the correct answer, but it might be that you run out memory because the last iteration is much more demanding. How many classes you have ? Try reduce them and repeat with same parameters, maybe it will work.


#3

Hi @peizhou,

@marino-j is probably right - though why this happens only in the multi-GPU case is a mystery. We will try to replicate this error here but have not seen it so far.