Error in Extensive Validation after v4.4 update

hna · December 1, 2023, 9:08pm

I updated it to v4.4 and tried to run the Extensive Validation with the empiar_10025_subset data.
But, I got this error after 17 successful jobs:

===========
[CPU: 180.4 MB] Launching job: abinit

Creating job abinit, (homo_abinit)
Scheduling homo_abinit (homo_abinit) P10, J81

**** Kill signal sent by CryoSPARC (ID: ) ****

Job is unresponsive - no heartbeat received in 1800 seconds.

===============

But, the job (P10, J81) was complete without issue.

I found that there was any new jobs created or submitted after the job (P10, J81), and plainly waited and failed as I described.
The job (P10, J81) is for Ab-initio, and the next job should be Homo Refine, I think. But, the Homo Refine was not created.

(I updated
export CRYOSPARC_HEARTBEAT_SECONDS=1800
I can observe what is happening after the job (P10, J81) was complete.)

I tried 4 times, and I got exactly the same errors for 3 times after exact 17 successful jobs.
But, in one case, it went to 19 jobs, but it failed with the same errors.

The extensive validation tests were just fine before the update.
Any comments/suggestions would be very helpful.

It is installed for a cluster.
Thanks!

-Heechang

wtempel · December 1, 2023, 9:57pm

Hi @hna. Please can you

post a table view of the workspace that show the Elapsed times for the jobs in the validation workflow
email us the job report for J81

hna · December 5, 2023, 2:49pm

Hi @wtempel ,

Thank you so much for the response!
These are the screen shots of the failed Validation tasks.
As you see, there is only errors on J64.

I sent an email with the job report for J81.
Please let me know if there is anything I can do further.
Thank you so much!

-Heechang

wtempel · December 5, 2023, 3:12pm

Thanks @hna. Please can you email us the job report for J64 also.

hna · December 5, 2023, 3:25pm

Just sent the job report for J64. Thank you so much!! -Heechang

wtempel · December 5, 2023, 3:44pm

This may just include the info needed. It seems that SLURM terminated J64’s processes when the J64 hit the limit specified by the
#SBATCH --time= directive inside the cluster script template.

hna · December 5, 2023, 4:20pm

Thank you so much for the suggestion!
Yes, it seems that would cause the issue.
I will fix it and test it soon.
Thanks again!

-Heechang

hna · December 5, 2023, 7:10pm

@wtempel I confirmed it worked!

So, the thing that I found was that the first job of the Extensive Validation was running on the host previously, but with v4.4, it is running on a compute node with slurm.
That was why it was working previously before, because there was not wall time limit on the host.

And, I found many tasks are now on compute node, rather than on the host.
Previously, there was out of memory issues on the host because some of running on the host tasks are too big.
Hopefully, now these bigger tasks are running on the compute nodes.

Anyway, this resolved the issue I had!
Thank you so much for the help!!

-Heechang