Found non-finite values while processing

I’m getting a weird error when trying to process movies:

“Found non-finite values while processing”

Found non-finite values while processing J36/imported/001036135298739210565_20241019_V_<redacted>_1_Oct19_17.18.53.tif: {'micrograph_blob/vmax', 'micrograph_thumbnail_blob_1x/vmax', 'micrograph_thumbnail_blob_2x/vmax', 'micrograph_thumbnail_blob_1x/vmin', 'micrograph_thumbnail_blob_2x/vmin', 'micrograph_blob_non_dw/vmin', 'micrograph_blob_non_dw/vmax', 'micrograph_blob/vmin'}

All the micrographs look like this:

We are using Rocky Linux and just updated to 4.6 from the previous (4.5.x), and this is the first attempt to use 4.6. I noticed that on a different system, also Rocky Linux/4.6, this exact data is working just fine.

Not even really sure how to go about debugging this.

@ccgauvin94 Please can you post the outputs of these commands:

csprojectid=P99 # replace with actual project ID
csjobid=J199 # replace with id of failed job
cryosparcm joblog $csprojectid $csjobid | tail -n 40
cryosparcm eventlog $csprojectid $csjobid | tail -n 40
cryosparcm cli "get_job('$csprojectid', '$csjobid', 'job_type', 'version', 'instance_information', 'status',  'params_spec', 'errors_run')"

In order:

========= sending heartbeat at 2024-10-24 20:36:09.230115
========= sending heartbeat at 2024-10-24 20:36:19.251514
========= sending heartbeat at 2024-10-24 20:36:29.275511
========= sending heartbeat at 2024-10-24 20:36:39.302313
========= sending heartbeat at 2024-10-24 20:36:49.324273
========= sending heartbeat at 2024-10-24 20:36:59.347919
========= sending heartbeat at 2024-10-24 20:37:09.373762
========= sending heartbeat at 2024-10-24 20:37:19.397559
========= sending heartbeat at 2024-10-24 20:37:29.421677
========= sending heartbeat at 2024-10-24 20:37:39.445802
========= sending heartbeat at 2024-10-24 20:37:49.468913
========= sending heartbeat at 2024-10-24 20:37:59.490715
========= sending heartbeat at 2024-10-24 20:38:09.516011
========= sending heartbeat at 2024-10-24 20:38:19.533646
========= sending heartbeat at 2024-10-24 20:38:29.558678
========= sending heartbeat at 2024-10-24 20:38:39.580403
========= sending heartbeat at 2024-10-24 20:38:49.606448
========= sending heartbeat at 2024-10-24 20:38:59.629306
========= sending heartbeat at 2024-10-24 20:39:09.651614
========= sending heartbeat at 2024-10-24 20:39:19.674810
========= sending heartbeat at 2024-10-24 20:39:29.697130
========= sending heartbeat at 2024-10-24 20:39:39.718593
========= sending heartbeat at 2024-10-24 20:39:49.744227
========= sending heartbeat at 2024-10-24 20:39:59.763384
========= sending heartbeat at 2024-10-24 20:40:09.787395
========= sending heartbeat at 2024-10-24 20:40:19.813491
========= sending heartbeat at 2024-10-24 20:40:29.833620
========= sending heartbeat at 2024-10-24 20:40:39.859105
========= sending heartbeat at 2024-10-24 20:40:49.882414
========= sending heartbeat at 2024-10-24 20:40:59.907405
========= sending heartbeat at 2024-10-24 20:41:09.923563
========= sending heartbeat at 2024-10-24 20:41:19.948342
========= sending heartbeat at 2024-10-24 20:41:29.967682
========= sending heartbeat at 2024-10-24 20:41:39.989454
========= sending heartbeat at 2024-10-24 20:41:50.003560
========= sending heartbeat at 2024-10-24 20:42:00.017711
========= sending heartbeat at 2024-10-24 20:42:10.046034
========= sending heartbeat at 2024-10-24 20:42:20.067784
========= sending heartbeat at 2024-10-24 20:42:30.088599
        Done in 0.01s
        Writing 240x240 micrograph thumbnail to J37/thumbnails/011353936749481604755_20241019_V_redacted_V2258242472_1_Oct19_17.35.10_thumb_@2x.png ...
        Done in 0.01s
        Writing dose-weighted result to J37/motioncorrected/011353936749481604755_20241019_V_redacted_V2258242472_1_Oct19_17.35.10_patch_aligned_doseweighted.mrc ...
        Done in 0.14s
        Writing background estimate to J37/motioncorrected/011353936749481604755_20241019_V_redacted_V2258242472_1_Oct19_17.35.10_background.mrc ...
        Done in 0.01s
        Writing motion estimates...
        Done in 0.01s
[Thu, 24 Oct 2024 20:41:32 GMT] [CPU RAM used: 1870 MB] -- 3.0: processing 120 of 7720: J36/imported/002604789264468600263_20241019_V_redacted_V2258242472_1_Oct19_17.35.15.tif
        loading /cryoem_share/VALARA_redacted_GS_SUN_V2258242472_10202024/CS-valara-redacted-gs-v2258242472-1/J36/imported/002604789264468600263_20241019_V_redacted_V2258242472_1_Oct19_17.35.15.tif
        Loading raw movie data from J36/imported/002604789264468600263_20241019_V_redacted_V2258242472_1_Oct19_17.35.15.tif ...
        Done in 2.16s
        Loading gain data from J36/imported/CountRef_20241019_V_redacted_V2258242472_1_Oct19_17.15.11.mrc ...
        Done in 0.00s
        Processing ...
[Thu, 24 Oct 2024 20:41:32 GMT] [CPU RAM used: 420 MB] Found non-finite values while processing J36/imported/017927616592212853050_20241019_V_redacted_V2258242472_1_Oct19_17.33.51.tif: {'microg
raph_blob/vmax', 'micrograph_thumbnail_blob_1x/vmax', 'micrograph_thumbnail_blob_2x/vmax', 'micrograph_thumbnail_blob_1x/vmin', 'micrograph_thumbnail_blob_2x/vmin', 'micrograph_blob_non_dw/
vmin', 'micrograph_blob_non_dw/vmax', 'micrograph_blob/vmin'}
Marking as incomplete and continuing...
[Thu, 24 Oct 2024 20:42:26 GMT] [CPU RAM used: 1886 MB] -- 2.0: processing 121 of 7720: J36/imported/002081489210071029544_20241019_V_redacted_V2258242472_1_Oct19_17.35.20.tif
        loading /cryoem_share/VALARA_redacted_GS_SUN_V2258242472_10202024/CS-valara-redacted-gs-v2258242472-1/J36/imported/002081489210071029544_20241019_V_redacted_V2258242472_1_Oct19_17.35.20.tif
        Loading raw movie data from J36/imported/002081489210071029544_20241019_V_redacted_V2258242472_1_Oct19_17.35.20.tif ...
        Done in 2.16s
        Loading gain data from J36/imported/CountRef_20241019_V_redacted_V2258242472_1_Oct19_17.15.11.mrc ...
        Done in 0.00s
        Processing ...
[Thu, 24 Oct 2024 20:42:30 GMT] [CPU RAM used: 1914 MB] -- 0.0: processing 122 of 7720: J36/imported/002570983219471162082_20241019_V_redacted_V2258242472_1_Oct19_17.35.26.tif
        loading /cryoem_share/VALARA_redacted_GS_SUN_V2258242472_10202024/CS-valara-redacted-gs-v2258242472-1/J36/imported/002570983219471162082_20241019_V_redacted_V2258242472_1_Oct19_17.35.26.tif
        Loading raw movie data from J36/imported/002570983219471162082_20241019_V_redacted_V2258242472_1_Oct19_17.35.26.tif ...
        Done in 2.16s
        Loading gain data from J36/imported/CountRef_20241019_V_redacted_V2258242472_1_Oct19_17.15.11.mrc ...
        Done in 0.00s
        Processing ...
[Thu, 24 Oct 2024 20:42:32 GMT] [CPU RAM used: 1890 MB] -- 1.0: processing 123 of 7720: J36/imported/008373004332207425256_20241019_V_redacted_V2258242472_1_Oct19_17.35.31.tif
        loading /cryoem_share/VALARA_redacted_GS_SUN_V2258242472_10202024/CS-valara-redacted-gs-v2258242472-1/J36/imported/008373004332207425256_20241019_V_redacted_V2258242472_1_Oct19_17.35.31.tif
        Loading raw movie data from J36/imported/008373004332207425256_20241019_V_redacted_V2258242472_1_Oct19_17.35.31.tif ...
        Done in 2.15s
        Loading gain data from J36/imported/CountRef_20241019_V_redacted_V2258242472_1_Oct19_17.15.11.mrc ...
        Done in 0.00s
        Processing ...
{'_id': '671aa594940f4916dfb0c2e0', 'errors_run': [], 'instance_information': {'CUDA_version': '11.8', 'available_memory': '201.06GB', 'cpu_model': 'Intel(R) Xeon(R) CPU @ 2.30GHz', 'driver
_version': '12.0', 'gpu_info': [{'id': 0, 'mem': 17066622976, 'name': 'Tesla P100-PCIE-16GB', 'pcie': '0000:00:04'}, {'id': 1, 'mem': 17066622976, 'name': 'Tesla P100-PCIE-16GB', 'pcie': '0
000:00:05'}, {'id': 2, 'mem': 17066622976, 'name': 'Tesla P100-PCIE-16GB', 'pcie': '0000:00:06'}, {'id': 3, 'mem': 17066622976, 'name': 'Tesla P100-PCIE-16GB', 'pcie': '0000:00:07'}], 'ofd_
hard_limit': 65536, 'ofd_soft_limit': 65536, 'physical_cores': 16, 'platform_architecture': 'x86_64', 'platform_node': 'pipeline-34368', 'platform_release': '5.7.7-1.el7.elrepo.x86_64', 'pl
atform_version': '#1 SMP Wed Jul 1 11:53:16 EDT 2020', 'total_memory': '204.47GB', 'used_memory': '1.70GB'}, 'job_type': 'patch_motion_correction_multi', 'params_spec': {'compute_num_gpus':
 {'value': 4}, 'output_f16': {'value': True}}, 'project_uid': 'P23', 'status': 'killed', 'uid': 'J37', 'version': 'v4.6.0'}

Hi @ccgauvin94,

I noticed that on a different system, also Rocky Linux/4.6, this exact data is working just fine.

Interesting. If you clone and run the problematic job on the same system, does the error happen again?

– Harris

Hi Harris,

That doesn’t seem to make a difference. Still get the same gray micrographs and same error.

Hi @ccgauvin94,

Are the two computers part of the same cryosparc installation? Is the same filesystem mounted on both systems, and the same projects visible? Or are they separate installations, just the same movies?

In either case, assuming it’s not literally a clone of the same job that fails on one system and succeeds on the other, could you check that the other inputs to the import movies job are the same? defect file, gain reference, parameters, etc

– Harris

These are separate installations, separate filesystems. Just copied the data to each to check if the was the data. All other inputs are identical between the two jobs. File sizes also appear to be identical.

I am doing a bit of digging and noticed something unusual. I cloned a job that ran on 4.5.x successfully, and noticed while the job doesn’t fail, it no longer gives close to the same output:

Here is the job that ran on 4.5.x:

And here is that exact same job, just cloned on 4.6:

image

Here is the job log for the 3D class that worked:

Here is the job log for the cloned 3D class that seems to have issues:

One more tidbit re: motion correction job.

If I clone the job that isn’t working, and set it to 1 GPU (instead of 2 or 4), the error goes away. I still get gray micrographs though. But the “non-finite values” error in red goes away.

I’m noticing that this behavior is quite inconsistent. Sometimes, the job runs just fine, usually only with a single GPU. Other times, even with a single GPU it doesn’t run OK.

Sometimes, I get the gray micrograph. Other times, I get a strange fuzziness at the top of the micrograph.

image

I’m also noticing that even when patch motion works fine, 2D classification (even on a single GPU) will not. It thinks that the classes are immediately converged, and the noise model gets very strange:

image

Like it’s not actually classifying them.

At this point, I’ve reproduced this behavior across several datasets, including old datasets that worked just fine on existing hardware.

For instance, here is another 3D classification job processed on this system a while ago. This is the 3D consensus from the very beginning of the job:

And here is the same job, cloned today, same 3D consensus:

This is a cloud system, so I don’t think that anything is broken hardware wise, and to my knowledge, there were no config changes or driver updates between working 4.5.x and 4.6.

So we reinstalled CryoSPARC onto a node using L4 GPUs instead of the P100s we were using. that seems like it may have fixed the “Found on-finite values while processing”, but I’m still getting strange behavior in CryoSPARC Live:

Where the thumbnail looks good, but aligned micrograph and CTF images appear to be broken. However, the motion and CTF correctino look good, so I suspect it’s just something with the display?