Hi @wtempel
Yes, we are on the same machine, @DerLorenz as scientific user, me as operator.
I know that there were no config changes, as we deploy and manage such changes with config management and automation tools.
After the change end of may (to the master lock strategy), for a while there were no errors observed. Later, the daily number of jobs on the cryosparc machine increased. And those errors started to happen again (with the master lock strategy still enabled, no config changes were done).
When it happened, I also did see the lock acquisition logging (until we’ve removed the master lock strategy). The last entries of lock/unlock:
2024-07-25 13:33:44,833 job_run_lock INFO | Lock ssd_cache acquired by P150-J2432-1721899456
2024-07-25 13:35:06,275 job_run_unlock INFO | Releasing lock ssd_cache from P150-J2432-1721899456
2024-07-25 13:35:07,994 job_run_lock INFO | Lock ssd_cache acquired by P150-J2433-1721899493
2024-07-25 13:36:25,284 job_run_unlock INFO | Releasing lock ssd_cache from P150-J2433-1721899493
2024-07-25 13:36:26,801 job_run_lock INFO | Lock ssd_cache acquired by P150-J2429-1721899430
2024-07-25 13:37:47,330 job_run_unlock INFO | Releasing lock ssd_cache from P150-J2429-1721899430
2024-07-25 13:37:48,061 job_run_lock INFO | Lock ssd_cache acquired by P150-J2434-1721899791
2024-07-25 13:39:08,770 job_run_unlock INFO | Releasing lock ssd_cache from P150-J2434-1721899791
2024-07-25 13:39:11,449 job_run_lock INFO | Lock ssd_cache acquired by P150-J2435-1721899809
2024-07-25 13:40:32,694 job_run_unlock INFO | Releasing lock ssd_cache from P150-J2435-1721899809
2024-07-25 13:40:34,180 job_run_lock INFO | Lock ssd_cache acquired by P150-J2430-1721899850
2024-07-25 13:43:04,791 job_run_unlock INFO | Releasing lock ssd_cache from P150-J2430-1721899850
2024-07-25 13:43:05,245 job_run_lock INFO | Lock ssd_cache acquired by P150-J2431-1721899943
2024-07-25 13:45:37,425 job_run_unlock INFO | Releasing lock ssd_cache from P150-J2431-1721899943
2024-07-25 13:45:38,175 job_run_lock INFO | Lock ssd_cache acquired by P150-J2433-1721900542
2024-07-25 13:48:07,600 job_run_unlock INFO | Releasing lock ssd_cache from P150-J2433-1721900542
2024-07-25 13:48:07,824 job_run_lock INFO | Lock ssd_cache acquired by P150-J2437-1721900483
2024-07-25 13:50:49,105 job_run_unlock INFO | Releasing lock ssd_cache from P150-J2437-1721900483
2024-07-25 13:50:49,246 job_run_lock INFO | Lock ssd_cache acquired by P150-J2432-1721900444
2024-07-25 13:53:22,013 job_run_unlock INFO | Releasing lock ssd_cache from P150-J2432-1721900444
2024-07-25 13:53:24,564 job_run_lock INFO | Lock ssd_cache acquired by P150-J2436-1721899836
2024-07-25 13:55:57,209 job_run_unlock INFO | Releasing lock ssd_cache from P150-J2436-1721899836
2024-07-25 13:55:58,582 job_run_lock INFO | Lock ssd_cache acquired by P150-J2429-1721900660
2024-07-25 13:57:18,687 job_run_unlock INFO | Releasing lock ssd_cache from P150-J2429-1721900660
I also remember seeing in the client logs some polling for the lock when it was not immediately available.
Since we’ve gone back to the per-job ssd-cache (i.e. in a per-job /tmp that is removed after the job), the previously seen job failurs as described above are gone.
The cryosparc version we’re running is v4.5.1, since end of May. We did the update to v4.5.1 and cache reconfig at the same time.