- J52 and J53 were sent to different lanes, and were on different types of GPUs. The lane that J53 was sent to contains all of the GPUs that the lane that J52 was sent for in addition to other GPUs. J53 ran on one of the other GPUs.
- J52 and J53 specifically were run with THP set to [always], but when we changed to [madvise], this did not change anything in the behavior of identical jobs.
- The default lane nodes are part of the cluster but are not run through the slurm queuing system. We actually just got rid of the default lane entirely to avoid confusion.
- Each 1080Ti GPU only is assigned to one job at once.
It appears, however, that the issue is not with the 1080Ti GPUs themselves but with how one of our workers is set up. If we run the job on a 1080Ti GPU that is attached to a different worker, it runs fine. However if it is sent to one worker (which is also where the master is set up; this was the default lane and is also included in the 1080Ti lane), we get the hanging issue and when we kill the job (or other CS jobs on that worker) the worker goes into drain mode. I did see another thread on that issue specifically and my cluster administrator is looking more into it, so it might be an issue adjacent to the extraction job but not caused by the extraction job.
I will update if we find anything helpful.