Trap invalid opcode in cryosparc_io.so

karcaw · October 7, 2024, 4:16pm

Can someone help me understand why the cryosparc_io.so is getting invalid opcodes and killing the jobs at times? I have seen 21 instances of this this month:

01/all.gz:Oct  1 03:08:44 ga4 kernel: [3596257.176981] traps: python[237937] trap invalid opcode ip:2b47bae0feae sp:2b48d5a4c2c0 error:0 in cryosparc_io.so[2b47bae08000+2d000]
03/all.gz:Oct  3 09:19:49 ga7 kernel: [3791297.061444] traps: python[73443] trap invalid opcode ip:2b49fe804eae sp:2f43972622c0 error:0 in cryosparc_io.so[2b49fe7fd000+2d000]
03/all.gz:Oct  3 09:32:14 ga3 kernel: [3791913.314626] traps: python[211824] trap invalid opcode ip:2b1d1a86deae sp:2b1ea79a82c0 error:0 in cryosparc_io.so[2b1d1a866000+2d000]
03/all.gz:Oct  3 09:48:39 ga7 kernel: [3793026.671240] traps: python[75286] trap invalid opcode ip:2b6815cb55ca sp:2b69175394f0 error:0 in cryosparc_io.so[2b6815cae000+2d000]
03/all.gz:Oct  3 10:48:49 ga7 kernel: [3796636.763563] traps: python[77537] trap invalid opcode ip:2af653ba8eae sp:2af7551832c0 error:0 in cryosparc_io.so[2af653ba1000+2d000]
03/all.gz:Oct  3 12:02:11 ga19 kernel: [699512.990736] traps: python[86418] trap invalid opcode ip:2b8299c87eae sp:2b839b3372c0 error:0 in cryosparc_io.so[2b8299c80000+2d000]
03/all.gz:Oct  3 12:54:39 ga20 kernel: [3804078.859481] traps: python[102163] trap invalid opcode ip:2af4189c0eae sp:2af51be742c0 error:0 in cryosparc_io.so[2af4189b9000+2d000]
03/all.gz:Oct  3 13:47:45 ga17 kernel: [3807020.209252] traps: python[221885] trap invalid opcode ip:2b75a3c31eae sp:2b76a52832c0 error:0 in cryosparc_io.so[2b75a3c2a000+2d000]
03/all.gz:Oct  3 13:49:49 ga7 kernel: [3807497.504271] traps: python[87696] trap invalid opcode ip:2b950a51beae sp:2b960fd082c0 error:0 in cryosparc_io.so[2b950a514000+2d000]
03/all.gz:Oct  3 14:25:40 ga7 kernel: [3809648.975726] traps: python[89880] trap invalid opcode ip:2b9496f18eae sp:2b95aa6832c0 error:0 in cryosparc_io.so[2b9496f11000+2d000]
03/all.gz:Oct  3 14:27:13 ga17 kernel: [3809387.909975] traps: python[224094] trap invalid opcode ip:2b1fcbe99eae sp:2b20cfa8f2c0 error:0 in cryosparc_io.so[2b1fcbe92000+2d000]
03/all.gz:Oct  3 19:29:02 ga20 kernel: [3827741.776435] traps: python[121895] trap invalid opcode ip:2b31364f6eae sp:2b31e23102c0 error:0 in cryosparc_io.so[2b31364ef000+2d000]
03/all.gz:Oct  3 21:07:02 ga17 kernel: [3833376.526132] traps: python[246862] trap invalid opcode ip:2abf7eeb6eae sp:2ac029f7b2c0 error:0 in cryosparc_io.so[2abf7eeaf000+2d000]
04/all.gz:Oct  4 04:47:03 ga13 kernel: [3861134.840269] traps: python[26755] trap invalid opcode ip:2b37bfcf8eae sp:2b3853aa22c0 error:0 in cryosparc_io.so[2b37bfcf1000+2d000]
04/all.gz:Oct  4 04:55:12 ga8 kernel: [3861252.443258] traps: python[112030] trap invalid opcode ip:2b7822a35eae sp:2b7927fc22c0 error:0 in cryosparc_io.so[2b7822a2e000+2d000]
04/all.gz:Oct  4 05:33:20 ga17 kernel: [3863753.473876] traps: python[11436] trap invalid opcode ip:2b565bb0beae sp:2b57f56632c0 error:0 in cryosparc_io.so[2b565bb04000+2d000]
04/all.gz:Oct  4 05:54:45 ga8 kernel: [3864824.464155] traps: python[115475] trap invalid opcode ip:2b084b7a9eae sp:2b094fca12c0 error:0 in cryosparc_io.so[2b084b7a2000+2d000]
04/all.gz:Oct  4 17:06:53 ga11 kernel: [3905613.019307] traps: python[56718] trap invalid opcode ip:2ac88c096eae sp:2ac99f2a32c0 error:0 in cryosparc_io.so[2ac88c08f000+2d000]
05/all.gz:Oct  5 07:53:27 ga17 kernel: [3958556.529031] traps: python[95890] trap invalid opcode ip:2ba820dbdeae sp:2ba91d2142c0 error:0 in cryosparc_io.so[2ba820db6000+2d000]
05/all.gz:Oct  5 11:18:16 ga7 kernel: [3971213.810024] traps: python[235868] trap invalid opcode ip:2af346f7eeae sp:2af4856082c0 error:0 in cryosparc_io.so[2af346f77000+2d000]
06/all.gz:Oct  6 01:49:52 ga8 kernel: [4022916.264838] traps: python[1371] trap invalid opcode ip:2ab2fff435ca sp:2ab52caee8b0 error:0 in cryosparc_io.so[2ab2fff3c000+2d000]

I have also seen some logs in another file:

traps: python[19125] trap invalid opcode ip:2ac6e4a92b64 sp:7fff57bee120 error:0 in bin_motion.so[2ac6e4a63000+44000]

I am currently running cryosparc 4.5.3, and this seems to be a pre-compiled library that comes in the worker tar file, so it seems it was compiled with specific CPU opcodes that are not compatible with my AMD EPYC 74F3 24-Core processors.

Any help tracking this down or mitigating the issue would be helpful.

hsnyder · October 8, 2024, 5:25pm

Hi @karcaw,

Thanks for bringing this up on the forum. We don’t enable any instruction set extensions that wouldn’t be supported by your 74F3, but we do intentionally use invalid opcodes inside some of our assertions. So, each of these system log entries is probably associated with the abnormal termination of a specific CryoSPARC job. The job logs will probably have more information about what went wrong. If you look for job failures corresponding to those time stamps, and find anything that you don’t understand the cause of, feel free to post again with the relevant information from the job log and we’ll try to help you sort out what’s causing the jobs to fail.

– Harris

karcaw · October 9, 2024, 3:33pm

found this in one log:

platform_commit_mem: mprotect: Cannot allocate memory
assertion failed: platform_commit_mem(a->mem + a->committed, needed_amount)
/home/svc-pncc/cryosparc2_worker/bin/cryosparcw: line 150: 200269 Illegal instruction     python -c "import cryosparc_compute.run as run; run.run()" "$@"

hsnyder · October 9, 2024, 3:34pm

Looks like you’re running out of memory. How much RAM does your machine have, and what kind of job were you running?

karcaw · October 9, 2024, 3:49pm

These nodes that trigger on the cryosparc_io.so have 1TiB of memory in them. The node has 8 NVIDIA RTX A5000 with 24GiB of memory on them.

The user is running a Reference Based Motion Correction

Here is the memory graph of one of these jobs.

rabdella · October 9, 2024, 3:58pm

This error occurred during the compute empirical dose weights step. When I try on a less powerful node, I just get the bottom line with “…Illegal instruction…” without the memory error at the beginning of a RBMC job with manual hyperparameters specified before it has even processed a single movie.

3.43 million particles, 500 pix box size

I’ve tried many combinations of #gpus, memory cache sizes, turning off “slicing gpu also computes trajectories”

hsnyder · October 9, 2024, 5:17pm

Interesting, thanks for providing additional details. Are you using a cluster management system (e.g. SLURM), containerization, or any other system that might be limiting an individual process to a smaller amount of memory than what the compute node actually possesses?

rabdella · October 9, 2024, 5:42pm

The cluster submission script specifies this:

###SBATCH --mem=16000MB

I’m unsure if this is read by slurm or is considered commented out

karcaw · October 9, 2024, 5:47pm

i believe that is commented out, as it never worked quite correctly. we do have slurm controlling these jobs. My Slurm admin tells me that there is no memory limitations on the jobs.

rabdella · October 9, 2024, 6:05pm

The output file specifies that RAM = [0], [1] is allocated for the job.

rabdella · October 9, 2024, 6:20pm

Other job types have

###SBATCH --mem=24000MB

and are allocated RAM = [0], [1], [2] by slurm so that would suggest each unit is 8 GB and might be causing throttling.

EDIT: this value seems to change depending on the number of GPUs that are requested for the job. 2 GPUs give [0], [1], [2], [3] and 8 GPUs give [0], …, [24]

rabdella · October 9, 2024, 6:29pm

Which job was this? I can check how much RAM the output file says was allocated.

karcaw · October 9, 2024, 6:30pm

J267 is the job i looked at I believe

rabdella · October 9, 2024, 6:39pm

That was allocated:

RAM   :  [0, 1, 2, 3]

but I have no way of knowing what those units represent.

hsnyder · October 9, 2024, 9:54pm

Thanks @karcaw and @rabdella. A follow-up question: what is your “In-memory cache size” parameter set to in your RBMC job?

rabdella · October 9, 2024, 10:26pm

I’ve tried the default value of 80, changed it to 10, 700, 0.1, or 0.9 and none of those allow the job to finish. The value I specify affects how much RAM I am allocated for the job out of the 1 TB available. What is the difference between what that in-memory cache is doing and the amount of RAM otherwise needed for the job?

rabdella · October 17, 2024, 10:00pm

@hsnyder following up on this. We think that SLURM is handing all of the RAM over to the job but do these lines in the log file mean that the worker is only attempting to use a fraction of that?

#SBATCH --gres=gpu:2
###SBATCH --mem=32000MB  
Resources allocated:
GPU   :  [0, 1]
RAM   :  [0, 1, 2, 3]

#SBATCH --gres=gpu:4
###SBATCH —mem=64000MB 
Resources allocated:
GPU   :  [0, 1, 2, 3]
RAM   :  [0, 1, 2, 3, 4, 5, 6, 7]

hsnyder · October 22, 2024, 8:34pm

Hi @rabdella,

The resource values shown in the log that look like this:

Resources allocated:
GPU   :  [0, 1, 2, 3]
RAM   :  [0, 1, 2, 3, 4, 5, 6, 7]

are for accounting purposes only. They’re essentially a way for CryoSPARC to ensure that a node doesn’t get too heavily oversubscribed when a cluster management system isn’t being used (e.g. on a single workstation installation). CryoSPARC doesn’t (can’t, in fact) actually enforce that a job stays within those limits. If any memory limiting is enforced at all, the enforcement is external to CryoSPARC.

I’m still unsure why your jobs are unable to allocate the memory they’re asking for, especially on such a large node. If you were to upgrade to v4.6, the job log would contain a little bit more information when this happens, which might help me deduce why this is happening. That said, it seems probable to me that the issue lies in some detail of your HPC site configuration. Below is some information about what exactly is happening in the assertion failure case you posted above, which may help your site sysadmin staff to figure out whether there might be any site-specific configuration relevant to this issue.

RBMC reserves a large amount of virtual memory, and “commits” it (i.e. allocates it for real) on an as-needed basis. The messages platform_commit_mem: mprotect: Cannot allocate memory and assertion failed: platform_commit_mem(a->mem + a->committed, needed_amount) means that Linux gave RBMC the virtual address space reservation that it asked for, but when it came time to commit some of the memory, Linux refused to provide it, saying “Cannot allocate memory”. Looking at the Linux documentation for mprotect(), the system call that we use to commit memory, several possible causes for this are listed, and unfortunately it’s not possible with the information we have to know which cause it is:

“Internal kernel structures could not be allocated.” I don’t know enough about the internals of the Linux kernel to know what would cause this, but if it has to do with memory fragmentation or a large number of other processes on the node, then a node reboot or dropping the filesystem caches might help.
“Addresses in the range [addr, addr+len-1] are invalid for the address space of the process, or specify one or more pages that are not mapped.” This would be a bug in RBMC, but given how frequently this is successfully tested by us and other users, I don’t think this is the most likely cause.
Ubuntu’s documentation adds a third possible cause: “Changing the protection of a memory region would result in the total number of mappings with distinct attributes (e.g., read versus read/write protection) exceeding the allowed maximum.” To me, this sounds like a subset of the first bullet point, but again I’m not familiar enough with the Linux kernel internals to be able to comment in more detail. RBMC doesn’t do enough

The only limit I’m aware of that relates to the number of allowed virtual memory mappings is the maximum map count, which you can check by running cat /proc/sys/vm/max_map_count. But this is usually a large number (65530 by default), and I doubt it would be set low enough to cause a problem for RBMC in practice.

Could you share what distribution of Linux and what version of that distribution you are using? I’d be happy to check the distro-specific documentation as well.

–Harris