Hi all,
I am pasting below the event log of my 3D Flex Training job, which ends in the error “AssertionError: Dataset size is not divisible by batch size.”
Here is some extra info that might help in troubleshooting. I’m not sure which pieces are relevant, but I’m hoping someone familiar with the 3D Flex code will find some useful clues somewhere in here:
- The number of particles that I am loading into the job is 1,609,000 (from “Prepared particles” output of 3D Flex Data Prep). Note that this number is not divisible by what I am interpreting as the number of batches from the event log (322), so I don’t know what the value of the batch size is. Also note that 1,609,000 has a large prime factor (1,609).
- To the same Prepared Particles input object, I connected the 3 components (components_mode_0, components_mode_1, components_mode_2) from a 3DVA job. I also adjusted the relevant 3D Flex Training inputs to use these components to initialize the latents (initialize latents from input: True; Initialize latents input indices: 0,1,2).
- If I do not use latent initialization, the job runs fine—no error.
- The 3DVA job and the Flex Data Prep jobs were run on the same stack of particles. This stack contained 1,621,478 particles.
- The 3DVA job produced two particle outputs. Particles: 1,621,453. Rejected particles: 25.
- After removing particles with low scale factors and rounding down to the nearest 1000, the 3D Flex Data Prep job produced an output with 1,609,000 particles.
- 1,608,976 particles from the 3D Flex Data Prep job output were present in the “Particles” output of the 3DVA job. 24 particles from the 3D Flex Data Prep job output were present in the “Rejected particles” output of the 3DVA job.
- I suspected that the problem was that 3D Flex Data Prep doesn’t know what to do with the 24 particles for which latent coordinates were not assigned by 3DVA. I created a new stack of particles that is the 1,608,976-particle intersection stack mentioned above ^, downsampled to 1,608,000 particles. I used this as the input to a new 3D Flex Training job with latent initialization. This ran without errors.
Please let me know if I diagnosed the problem correctly. If so, here are some ideas for what 3D Flex Training could do in this scenario:
-It looks like 3DVA does actually write out latent coordinates for rejected particles, but it’s as a separate object. Somehow bundle the latent coordinates from the accepted and rejected particles as a single object so they can all be connected to the Prepared Particles input of the 3D Flex Training job?
-Initialize the latent coordinates of those 24 particles as (0,0,0).
3DVA is probably telling me that I don’t want those particles in my dataset anyway, so:
-Have 3D Flex Training automatically do what I did manually. Throw out the 24 particles rejected by 3DVA and then round down to the nearest 1,000.
-Even better: ask 3D Flex Data Prep to save a final partial batch of Prepared Particles with (particle number % 1000) particles. Use these extra particles to pad the holes in the 3D Flex Training job that result from rejected particle dropout. With that approach, I’d get 1,609,000 instead of 1,608,000 particles in my final 3D Flex Training job.
Of course, I’m guessing that I could also solve the problem by re-running 3D Flex Data Prep on the accepted Particles output of 3DVA, but that would take a long time, and it seems wasteful to have to do that every time I run a new 3DVA job.
Any other advice would be appreciated—just want to make sure I’m not misunderstanding something. Thank you!
-Josh
[CPU: 1.25 GB Avail: 21.97 GB]
License is valid.
[CPU: 1.25 GB Avail: 21.97 GB]
Launching job on lane REDACTED …
[CPU: 1.25 GB Avail: 21.97 GB]
Running job on remote worker node hostname REDACTED
[CPU: 148.5 MB Avail: 457.37 GB]
Job P266-J198 started
[CPU: 148.5 MB Avail: 457.37 GB]
Master running v5.0.3, worker running v5.0.3
[CPU: 150.5 MB Avail: 457.66 GB]
Working in directory: REDACTED
[CPU: 150.5 MB Avail: 457.67 GB]
Running on lane REDACTED
[CPU: 150.5 MB Avail: 457.69 GB]
Resources Allocated ----------- ------------------------- Worker REDACTED CPU [0, 1, 2, 3] GPU [0] RAM [0, 1, 2, 3, 4, 5, 9, 10] SSD False
[CPU: 150.5 MB Avail: 457.70 GB]
──────────────────────────────────────────────────────────────
[CPU: 150.5 MB Avail: 457.70 GB]
Importing job module for job type flex_train…
[CPU: 720.6 MB Avail: 457.18 GB]
Job ready to run
[CPU: 720.6 MB Avail: 457.18 GB]
──────────────────────────────────────────────────────────────
[CPU: 1.26 GB Avail: 456.76 GB]
====== 3D Flex Training Model Setup =======
[CPU: 1.26 GB Avail: 456.76 GB]
Loading mesh…
Input tetramesh
REDACTED
[CPU: 1.31 GB Avail: 456.61 GB]
Input particles already have associated latent coordinates.
[CPU: 1.31 GB Avail: 456.61 GB]
Parameter “Initialize latents from input” was set. Initializing latent coordinates from input.
[CPU: 1.31 GB Avail: 456.61 GB]
“Initialize latents input indices” was specified. Using input components [0, 1, 2]
[CPU: 1.41 GB Avail: 456.52 GB]
Reading in all particle data…
[CPU: 1.41 GB Avail: 456.52 GB]
Reading file 322 of 322 (J159/J159_particles_train_batch_00321.mrc)
[CPU: 102.75 GB Avail: 252.35 GB]
Reading in all particle CTF data…
[CPU: 102.75 GB Avail: 252.37 GB]
Reading file 322 of 322 (J159/J159_particles_train_batch_00321_ctf.mrc)
[CPU: 153.82 GB Avail: 149.31 GB]
Setting up particle poses..
[CPU: 153.88 GB Avail: 149.26 GB]
Initializing torch..
[CPU: 153.98 GB Avail: 148.96 GB]
====== Test reconstruction with zero deformation =======
[CPU: 153.95 GB Avail: 148.94 GB]
Traceback (most recent call last): File “cli/run.py”, line 106, in cli.run.run_job File “cli/run.py”, line 211, in cli.run.run_job_function File “compute/jobs/flex_refine/run_train.py”, line 177, in compute.jobs.flex_refine.run_train.run File “compute/jobs/flex_refine/flexmod.py”, line 373, in compute.jobs.flex_refine.flexmod.run_test_density_opt AssertionError: Dataset size is not divisible by batch size