Trap invalid opcode in cryosparc_io.so

Can someone help me understand why the cryosparc_io.so is getting invalid opcodes and killing the jobs at times? I have seen 21 instances of this this month:

01/all.gz:Oct  1 03:08:44 ga4 kernel: [3596257.176981] traps: python[237937] trap invalid opcode ip:2b47bae0feae sp:2b48d5a4c2c0 error:0 in cryosparc_io.so[2b47bae08000+2d000]
03/all.gz:Oct  3 09:19:49 ga7 kernel: [3791297.061444] traps: python[73443] trap invalid opcode ip:2b49fe804eae sp:2f43972622c0 error:0 in cryosparc_io.so[2b49fe7fd000+2d000]
03/all.gz:Oct  3 09:32:14 ga3 kernel: [3791913.314626] traps: python[211824] trap invalid opcode ip:2b1d1a86deae sp:2b1ea79a82c0 error:0 in cryosparc_io.so[2b1d1a866000+2d000]
03/all.gz:Oct  3 09:48:39 ga7 kernel: [3793026.671240] traps: python[75286] trap invalid opcode ip:2b6815cb55ca sp:2b69175394f0 error:0 in cryosparc_io.so[2b6815cae000+2d000]
03/all.gz:Oct  3 10:48:49 ga7 kernel: [3796636.763563] traps: python[77537] trap invalid opcode ip:2af653ba8eae sp:2af7551832c0 error:0 in cryosparc_io.so[2af653ba1000+2d000]
03/all.gz:Oct  3 12:02:11 ga19 kernel: [699512.990736] traps: python[86418] trap invalid opcode ip:2b8299c87eae sp:2b839b3372c0 error:0 in cryosparc_io.so[2b8299c80000+2d000]
03/all.gz:Oct  3 12:54:39 ga20 kernel: [3804078.859481] traps: python[102163] trap invalid opcode ip:2af4189c0eae sp:2af51be742c0 error:0 in cryosparc_io.so[2af4189b9000+2d000]
03/all.gz:Oct  3 13:47:45 ga17 kernel: [3807020.209252] traps: python[221885] trap invalid opcode ip:2b75a3c31eae sp:2b76a52832c0 error:0 in cryosparc_io.so[2b75a3c2a000+2d000]
03/all.gz:Oct  3 13:49:49 ga7 kernel: [3807497.504271] traps: python[87696] trap invalid opcode ip:2b950a51beae sp:2b960fd082c0 error:0 in cryosparc_io.so[2b950a514000+2d000]
03/all.gz:Oct  3 14:25:40 ga7 kernel: [3809648.975726] traps: python[89880] trap invalid opcode ip:2b9496f18eae sp:2b95aa6832c0 error:0 in cryosparc_io.so[2b9496f11000+2d000]
03/all.gz:Oct  3 14:27:13 ga17 kernel: [3809387.909975] traps: python[224094] trap invalid opcode ip:2b1fcbe99eae sp:2b20cfa8f2c0 error:0 in cryosparc_io.so[2b1fcbe92000+2d000]
03/all.gz:Oct  3 19:29:02 ga20 kernel: [3827741.776435] traps: python[121895] trap invalid opcode ip:2b31364f6eae sp:2b31e23102c0 error:0 in cryosparc_io.so[2b31364ef000+2d000]
03/all.gz:Oct  3 21:07:02 ga17 kernel: [3833376.526132] traps: python[246862] trap invalid opcode ip:2abf7eeb6eae sp:2ac029f7b2c0 error:0 in cryosparc_io.so[2abf7eeaf000+2d000]
04/all.gz:Oct  4 04:47:03 ga13 kernel: [3861134.840269] traps: python[26755] trap invalid opcode ip:2b37bfcf8eae sp:2b3853aa22c0 error:0 in cryosparc_io.so[2b37bfcf1000+2d000]
04/all.gz:Oct  4 04:55:12 ga8 kernel: [3861252.443258] traps: python[112030] trap invalid opcode ip:2b7822a35eae sp:2b7927fc22c0 error:0 in cryosparc_io.so[2b7822a2e000+2d000]
04/all.gz:Oct  4 05:33:20 ga17 kernel: [3863753.473876] traps: python[11436] trap invalid opcode ip:2b565bb0beae sp:2b57f56632c0 error:0 in cryosparc_io.so[2b565bb04000+2d000]
04/all.gz:Oct  4 05:54:45 ga8 kernel: [3864824.464155] traps: python[115475] trap invalid opcode ip:2b084b7a9eae sp:2b094fca12c0 error:0 in cryosparc_io.so[2b084b7a2000+2d000]
04/all.gz:Oct  4 17:06:53 ga11 kernel: [3905613.019307] traps: python[56718] trap invalid opcode ip:2ac88c096eae sp:2ac99f2a32c0 error:0 in cryosparc_io.so[2ac88c08f000+2d000]
05/all.gz:Oct  5 07:53:27 ga17 kernel: [3958556.529031] traps: python[95890] trap invalid opcode ip:2ba820dbdeae sp:2ba91d2142c0 error:0 in cryosparc_io.so[2ba820db6000+2d000]
05/all.gz:Oct  5 11:18:16 ga7 kernel: [3971213.810024] traps: python[235868] trap invalid opcode ip:2af346f7eeae sp:2af4856082c0 error:0 in cryosparc_io.so[2af346f77000+2d000]
06/all.gz:Oct  6 01:49:52 ga8 kernel: [4022916.264838] traps: python[1371] trap invalid opcode ip:2ab2fff435ca sp:2ab52caee8b0 error:0 in cryosparc_io.so[2ab2fff3c000+2d000]

I have also seen some logs in another file:

traps: python[19125] trap invalid opcode ip:2ac6e4a92b64 sp:7fff57bee120 error:0 in bin_motion.so[2ac6e4a63000+44000]

I am currently running cryosparc 4.5.3, and this seems to be a pre-compiled library that comes in the worker tar file, so it seems it was compiled with specific CPU opcodes that are not compatible with my AMD EPYC 74F3 24-Core processors.

Any help tracking this down or mitigating the issue would be helpful.

Hi @karcaw,

Thanks for bringing this up on the forum. We don’t enable any instruction set extensions that wouldn’t be supported by your 74F3, but we do intentionally use invalid opcodes inside some of our assertions. So, each of these system log entries is probably associated with the abnormal termination of a specific CryoSPARC job. The job logs will probably have more information about what went wrong. If you look for job failures corresponding to those time stamps, and find anything that you don’t understand the cause of, feel free to post again with the relevant information from the job log and we’ll try to help you sort out what’s causing the jobs to fail.

– Harris

found this in one log:

platform_commit_mem: mprotect: Cannot allocate memory
assertion failed: platform_commit_mem(a->mem + a->committed, needed_amount)
/home/svc-pncc/cryosparc2_worker/bin/cryosparcw: line 150: 200269 Illegal instruction     python -c "import cryosparc_compute.run as run; run.run()" "$@"

Looks like you’re running out of memory. How much RAM does your machine have, and what kind of job were you running?

These nodes that trigger on the cryosparc_io.so have 1TiB of memory in them. The node has 8 NVIDIA RTX A5000 with 24GiB of memory on them.

The user is running a Reference Based Motion Correction

Here is the memory graph of one of these jobs.

This error occurred during the compute empirical dose weights step. When I try on a less powerful node, I just get the bottom line with “…Illegal instruction…” without the memory error at the beginning of a RBMC job with manual hyperparameters specified before it has even processed a single movie.

3.43 million particles, 500 pix box size

I’ve tried many combinations of #gpus, memory cache sizes, turning off “slicing gpu also computes trajectories”

Interesting, thanks for providing additional details. Are you using a cluster management system (e.g. SLURM), containerization, or any other system that might be limiting an individual process to a smaller amount of memory than what the compute node actually possesses?

The cluster submission script specifies this:

###SBATCH --mem=16000MB 

I’m unsure if this is read by slurm or is considered commented out

i believe that is commented out, as it never worked quite correctly. we do have slurm controlling these jobs. My Slurm admin tells me that there is no memory limitations on the jobs.

The output file specifies that RAM = [0], [1] is allocated for the job.

Other job types have

###SBATCH --mem=24000MB

and are allocated RAM = [0], [1], [2] by slurm so that would suggest each unit is 8 GB and might be causing throttling.

EDIT: this value seems to change depending on the number of GPUs that are requested for the job. 2 GPUs give [0], [1], [2], [3] and 8 GPUs give [0], …, [24]

Which job was this? I can check how much RAM the output file says was allocated.

J267 is the job i looked at I believe

1 Like

That was allocated:

RAM   :  [0, 1, 2, 3]

but I have no way of knowing what those units represent.

Thanks @karcaw and @rabdella. A follow-up question: what is your “In-memory cache size” parameter set to in your RBMC job?

I’ve tried the default value of 80, changed it to 10, 700, 0.1, or 0.9 and none of those allow the job to finish. The value I specify affects how much RAM I am allocated for the job out of the 1 TB available. What is the difference between what that in-memory cache is doing and the amount of RAM otherwise needed for the job?

@hsnyder following up on this. We think that SLURM is handing all of the RAM over to the job but do these lines in the log file mean that the worker is only attempting to use a fraction of that?

#SBATCH --gres=gpu:2
###SBATCH --mem=32000MB  
Resources allocated:
GPU   :  [0, 1]
RAM   :  [0, 1, 2, 3]
#SBATCH --gres=gpu:4
###SBATCH —mem=64000MB 
Resources allocated:
GPU   :  [0, 1, 2, 3]
RAM   :  [0, 1, 2, 3, 4, 5, 6, 7]