EMReady with 50-series GPUs

AlexeiG · June 2, 2026, 7:50pm

Has anyone successfully set up the EMReady or EMReady2 sharpening program on a 50-series GPU? Its default CUDA version 11.8 is incompatible with these GPUs. If so, would you mind sharing all the package versions inside your conda environment (conda list). Thank you!

rbs_sci · June 3, 2026, 9:12am

Thanks for reminding me - I need to investigate this myself.

EMReady2 seems to have had some updates since I last downloaded it. I guess I’ll have a look at the current version.

Tentatively, I’ve got a conda environment with Python 3.14 and CUDA13 PyTorch and all other (updated) dependencies working, but it’s broken the selective_scan_interface.py code so it starts running then falls flat on its face. I’ll update if I make any progress on a fix.

rbs_sci · June 4, 2026, 8:11am

OK, I’ve got it working. I think. I need to test a little more to make sure output is sane across a range of maps. I’ll write up how I got it working and make a pull request on the emready github tomorrow.

stavros · June 4, 2026, 9:38am

have it working as well, (with a working progress bar :P) if you need any help let me know

pey123456 · June 4, 2026, 12:24pm

Hi，you can refer this

github.com/huang-laboratory/EMReady2

Blackwell Processors

opened 03:20PM - 08 May 26 UTC

stav-ros

To make the software accessible to an Nvidia 5090RTX GPU (blackwell arch 120) I …have made the following modification in emready/vendor/bimamba_ssm/ops/selective_scan_interface.py: (Line around 265) ``` # Final patch: No variables, just the direct requirements for 1.6.1 conv1d_out = torch.empty_like(x) causal_conv1d_cuda.causal_conv1d_fwd( x, conv1d_weight, conv1d_bias, None, None, conv1d_out, None, True ) ``` Also installed - Name: mamba-ssm - Version: 2.3.1 - - Name: causal-conv1d - Version: 1.6.1 - And the following dependencies: ``` name: emready2 channels: - conda-forge dependencies: - _openmp_mutex=4.5=20_gnu - alsa-lib=1.2.15.3=hb03c661_0 - binutils=2.45.1=default_h4852527_102 - binutils_impl_linux-64=2.45.1=default_hfdba357_102 - binutils_linux-64=2.45.1=default_h4852527_102 - bzip2=1.0.8=hda65f42_9 - c-compiler=1.11.0=h4d9bdce_0 - ca-certificates=2026.4.22=hbd8a1cb_0 - conda-gcc-specs=14.3.0=he8ccf15_18 - cuda-cccl_linux-64=13.0.85=ha770c72_0 - cuda-command-line-tools=13.0.3=ha770c72_0 - cuda-compiler=13.0.3=hbad6d8a_0 - cuda-crt-dev_linux-64=13.0.88=ha770c72_0 - cuda-crt-tools=13.0.88=ha770c72_0 - cuda-ctadvisor=13.0.85=h676940d_0 - cuda-cudart=13.0.96=hecca717_0 - cuda-cudart-dev=13.0.96=hecca717_0 - cuda-cudart-dev_linux-64=13.0.96=h376f20c_0 - cuda-cudart-static=13.0.96=hecca717_0 - cuda-cudart-static_linux-64=13.0.96=h376f20c_0 - cuda-cudart_linux-64=13.0.96=h376f20c_0 - cuda-cuobjdump=13.0.85=hffce074_0 - cuda-cupti=13.0.85=h676940d_0 - cuda-cupti-dev=13.0.85=h676940d_0 - cuda-cuxxfilt=13.0.85=hffce074_0 - cuda-driver-dev=13.0.96=hecca717_0 - cuda-driver-dev_linux-64=13.0.96=h376f20c_0 - cuda-gdb=13.0.85=hba53cbc_2 - cuda-libraries=13.0.3=ha770c72_0 - cuda-libraries-dev=13.0.3=ha770c72_0 - cuda-nsight=13.0.85=h7938cbb_0 - cuda-nvcc=13.0.88=hcdd1206_6 - cuda-nvcc-dev_linux-64=13.0.88=he91c749_0 - cuda-nvcc-impl=13.0.88=h85509e4_0 - cuda-nvcc-tools=13.0.88=he02047a_0 - cuda-nvcc_linux-64=13.0.88=hb2fc203_6 - cuda-nvdisasm=13.0.85=hffce074_0 - cuda-nvml-dev=13.0.87=hffce074_0 - cuda-nvprune=13.0.85=hffce074_0 - cuda-nvrtc=13.0.88=hecca717_0 - cuda-nvrtc-dev=13.0.88=hecca717_0 - cuda-nvtx=13.0.85=hecca717_0 - cuda-nvvm-dev_linux-64=13.0.88=ha770c72_0 - cuda-nvvm-impl=13.0.88=h4bc722e_0 - cuda-nvvm-tools=13.0.88=h4bc722e_0 - cuda-opencl=13.0.85=hecca717_0 - cuda-opencl-dev=13.0.85=hecca717_0 - cuda-profiler-api=13.0.85=h7938cbb_0 - cuda-sanitizer-api=13.0.85=h10ca0ad_0 - cuda-tools=13.0.3=ha770c72_0 - cuda-version=13.0=hc7b4dd1_3 - cuda-visual-tools=13.0.3=ha770c72_0 - cxx-compiler=1.11.0=hfcd1e18_0 - dbus=1.16.2=h24cb091_1 - font-ttf-dejavu-sans-mono=2.37=hab24e00_0 - font-ttf-inconsolata=3.000=h77eed37_0 - font-ttf-source-code-pro=2.038=h77eed37_0 - font-ttf-ubuntu=0.83=h77eed37_3 - fontconfig=2.17.1=h27c8c51_0 - fonts-conda-ecosystem=1=0 - fonts-conda-forge=1=hc364b38_1 - gcc=14.3.0=h0dff253_18 - gcc_impl_linux-64=14.3.0=hbdf3cc3_18 - gcc_linux-64=14.3.0=h50e9bb6_24 - gdb=14.2=py310h6006605_0 - gds-tools=1.15.1.6=hecca717_0 - gmp=6.3.0=hac33072_2 - gxx=14.3.0=h76987e4_18 - gxx_impl_linux-64=14.3.0=h2185e75_18 - gxx_linux-64=14.3.0=h8a413ad_24 - icu=78.3=h33c6efd_0 - kernel-headers_linux-64=6.12.0=he073ed8_6 - keyutils=1.6.3=hb9d3cd8_0 - krb5=1.21.3=h659f571_0 - ld_impl_linux-64=2.45.1=default_hbd61a6d_102 - libcap=2.77=hd0affe5_1 - libcublas=13.1.1.3=h676940d_0 - libcublas-dev=13.1.1.3=h676940d_0 - libcufft=12.0.0.61=hecca717_0 - libcufft-dev=12.0.0.61=hecca717_0 - libcufile=1.15.1.6=hbc026e6_0 - libcufile-dev=1.15.1.6=hecca717_0 - libcurand=10.4.0.35=h676940d_1 - libcurand-dev=10.4.0.35=h676940d_1 - libcusolver=12.0.4.66=h676940d_1 - libcusolver-dev=12.0.4.66=h676940d_1 - libcusparse=12.6.3.3=hecca717_0 - libcusparse-dev=12.6.3.3=hecca717_0 - libedit=3.1.20250104=pl5321h7949ede_0 - libexpat=2.8.0=hecca717_0 - libffi=3.5.2=h3435931_0 - libfreetype=2.14.3=ha770c72_0 - libfreetype6=2.14.3=h73754d4_0 - libgcc=15.2.0=he0feb66_18 - libgcc-devel_linux-64=14.3.0=hf649bbc_118 - libgcc-ng=15.2.0=h69a702a_18 - libgfortran=15.2.0=h69a702a_18 - libgfortran-ng=15.2.0=h69a702a_18 - libgfortran5=15.2.0=h68bc16d_18 - libglib=2.88.1=h0d30a3d_1 - libglvnd=1.7.0=ha4b6fd6_2 - libgomp=15.2.0=he0feb66_18 - libiconv=1.18=h3b78370_2 - liblzma=5.8.3=hb03c661_0 - liblzma-devel=5.8.3=hb03c661_0 - libnl=3.11.0=hb9d3cd8_0 - libnpp=13.0.1.2=h676940d_0 - libnpp-dev=13.0.1.2=h676940d_0 - libnsl=2.0.1=hb9d3cd8_1 - libnuma=2.0.18=hb03c661_3 - libnvfatbin=13.0.85=hecca717_0 - libnvfatbin-dev=13.0.85=hecca717_0 - libnvjitlink=13.0.88=hecca717_0 - libnvjitlink-dev=13.0.88=hecca717_0 - libnvjpeg=13.0.1.86=hecca717_0 - libnvjpeg-dev=13.0.1.86=ha770c72_0 - libnvptxcompiler-dev=13.0.88=ha770c72_0 - libnvptxcompiler-dev_linux-64=13.0.88=ha770c72_0 - libopengl=1.7.0=ha4b6fd6_2 - libpng=1.6.58=h421ea60_0 - libsanitizer=14.3.0=h8f1669f_18 - libsqlite=3.53.1=h0c1763c_0 - libstdcxx=15.2.0=h934c35e_18 - libstdcxx-devel_linux-64=14.3.0=h9f08a49_118 - libstdcxx-ng=15.2.0=hdf11a46_18 - libsystemd0=260.1=h6569c3e_0 - libudev1=260.1=h6569c3e_0 - libuuid=2.42=h5347b49_0 - libxcb=1.17.0=h8a09558_0 - libxcrypt=4.4.36=hd590300_1 - libxkbcommon=1.13.1=hca5e8e5_0 - libxkbfile=1.1.0=h166bdaf_1 - libxml2=2.15.3=h49c6c72_0 - libxml2-16=2.15.3=hca6bf5a_0 - libzlib=1.3.2=h25fd6f3_2 - mpfr=4.2.2=he0a73b1_0 - ncurses=6.6=hdb14827_0 - nsight-compute=2025.3.1.4=h6a507f3_0 - nspr=4.38=h29cc59b_0 - nss=3.118=h445c969_0 - ocl-icd=2.3.3=hb9d3cd8_0 - opencl-headers=2025.06.13=hecca717_0 - openssl=3.6.2=h35e630c_0 - packaging=26.2=pyhc364b38_0 - pcre2=10.47=haa7fec5_0 - pip=26.1.1=pyh8b19718_0 - pthread-stubs=0.4=hb9d3cd8_1002 - pygments=2.20.0=pyhd8ed1ab_0 - python=3.10.20=h3c07f61_0_cpython - python_abi=3.10=8_cp310 - rdma-core=62.0=h192683f_0 - readline=8.3=h853b02a_0 - six=1.17.0=pyhe01879c_1 - sysroot_linux-64=2.39=hc4b9eeb_6 - tk=8.6.13=noxft_h366c992_103 - tzdata=2025c=hc9c84f9_1 - wayland=1.25.0=hd6090a7_0 - wheel=0.47.0=pyhd8ed1ab_0 - xcb-util=0.4.1=h4f16b4b_2 - xcb-util-cursor=0.1.6=hb03c661_0 - xcb-util-image=0.4.0=hb711507_2 - xcb-util-keysyms=0.4.1=hb711507_0 - xcb-util-renderutil=0.3.10=hb711507_0 - xcb-util-wm=0.4.2=hb711507_0 - xkeyboard-config=2.47=hb03c661_0 - xorg-libice=1.1.2=hb9d3cd8_0 - xorg-libsm=1.2.6=he73a12e_0 - xorg-libx11=1.8.13=he1eb515_0 - xorg-libxau=1.0.12=hb03c661_1 - xorg-libxcomposite=0.4.7=hb03c661_0 - xorg-libxdamage=1.1.6=hb9d3cd8_0 - xorg-libxdmcp=1.1.5=hb03c661_1 - xorg-libxext=1.3.7=hb03c661_0 - xorg-libxfixes=6.0.2=hb03c661_0 - xorg-libxi=1.8.2=hb9d3cd8_0 - xorg-libxrandr=1.5.5=hb03c661_0 - xorg-libxrender=0.9.12=hb9d3cd8_0 - xorg-libxtst=1.2.5=hb9d3cd8_3 - xz=5.8.3=ha02ee65_0 - xz-gpl-tools=5.8.3=ha02ee65_0 - xz-tools=5.8.3=hb03c661_0 - zlib=1.3.2=h25fd6f3_2 - zstd=1.5.7=hb78ec9c_6 - pip: - biopython==1.86 - causal-conv1d==1.6.1 - certifi==2026.4.22 - charset-normalizer==3.4.7 - cuda-bindings==13.0.3 - cuda-pathfinder==1.5.4 - cuda-toolkit==13.0.2 - einops==0.8.2 - emready==0.1.0 - filelock==3.29.0 - fsspec==2026.4.0 - hf-xet==1.5.0 - huggingface-hub==0.36.2 - idna==3.13 - jinja2==3.1.6 - llvmlite==0.46.0 - mamba-ssm==2.3.1 - markupsafe==3.0.3 - monai==1.4.0 - mpmath==1.3.0 - mrcfile==1.5.4 - networkx==3.4.2 - ninja==1.13.0 - numba==0.64.0 - numpy==1.26.4 - nvidia-cublas==13.1.0.3 - nvidia-cuda-cupti==13.0.85 - nvidia-cuda-nvrtc==13.0.88 - nvidia-cuda-runtime==13.0.96 - nvidia-cudnn-cu13==9.15.1.9 - nvidia-cufft==12.0.0.61 - nvidia-cufile==1.15.1.6 - nvidia-curand==10.4.0.35 - nvidia-cusolver==12.0.4.66 - nvidia-cusparse==12.6.3.3 - nvidia-cusparselt-cu13==0.8.0 - nvidia-nccl-cu13==2.28.9 - nvidia-nvjitlink==13.0.88 - nvidia-nvshmem-cu13==3.4.5 - nvidia-nvtx==13.0.85 - pillow==12.2.0 - pyyaml==6.0.3 - regex==2026.4.4 - requests==2.33.1 - safetensors==0.7.0 - scipy==1.13.0 - setuptools==81.0.0 - sympy==1.14.0 - tokenizers==0.22.2 - torch==2.10.0+cu130 - torchaudio==2.11.0+cu130 - torchvision==0.25.0+cu130 - tqdm==4.67.3 - transformers==4.57.3 - triton==3.6.0 - typing-extensions==4.15.0 - urllib3==2.7.0 prefix: /home/software/miniforge3/envs/emready2 ``` I hope this can be useful to other users or the developers. Thanks

I am using RTX5080, and it works.

For me, I upgrade the Nvidia driver, and install the new version of mamba-ssm, causal-conv1d …

rbs_sci · June 5, 2026, 12:46am

Nice to see others jump in.

Here’s a working Python 3.14 environment (as 3.10 does not have support for much longer):

conda create -n emready2 python=3.14
conda activate emready2
git clone https://github.com/huang-laboratory/EMReady2/
cd EMReady2
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130
pip install numpy scipy mrcfile biopython einops monai numba llvmlite transformers packaging wheel
pip install causal-conv1d ## takes a long time to build
pip install mamba-ssm --no-build-isolation ##takes even longer to build, NBI needed to find dependencies
pip install -e . --no-deps
git apply update.patch ## fixes selective_scan_interface.py crash
cd model_weights
wget http://huanglab.phys.hust.edu.cn/EMReady2/model_weights/model_1p0.pt
wget http://huanglab.phys.hust.edu.cn/EMReady2/model_weights/model_0p6.pt
cd ..
emready [input] [ouput]

The patch file:

diff --git a/emready/vendor/bimamba_ssm/ops/selective_scan_interface.py b/emready/vendor/bimamba_ssm/ops/selective_scan_interface.py
index 7a0f895..f72601e 100644
--- a/emready/vendor/bimamba_ssm/ops/selective_scan_interface.py
+++ b/emready/vendor/bimamba_ssm/ops/selective_scan_interface.py
@@ -252,8 +252,8 @@ class MambaInnerFnNoOutProj(torch.autograd.Function):
         conv1d_weight = rearrange(conv1d_weight, "d 1 w -> d w")
         x, z = xz.chunk(2, dim=1)
         conv1d_bias = conv1d_bias.contiguous() if conv1d_bias is not None else None
-        conv1d_out = causal_conv1d_cuda.causal_conv1d_fwd(
-            x, conv1d_weight, conv1d_bias, None, None, None, True
+        conv1d_out = causal_conv1d_fn(
+            x, conv1d_weight, conv1d_bias, activation="silu"
         )
         # We're being very careful here about the layout, to avoid extra transposes.
         # We want delta to have d as the slowest moving dimension
@@ -355,14 +355,14 @@ class MambaInnerFnNoOutProj(torch.autograd.Function):
         if dout.stride(-1) != 1:
             dout = dout.contiguous()
         if ctx.checkpoint_lvl == 1:
-            conv1d_out = causal_conv1d_cuda.causal_conv1d_fwd(
-                x, conv1d_weight, conv1d_bias, None, None, None, True
+            conv1d_out = causal_conv1d_fn(
+                x, conv1d_weight, conv1d_bias, activation="silu"
             )
             delta = rearrange(
                 delta_proj_weight @ x_dbl[:, :delta_rank].t(), "d (b l) -> b d l", l=L
             )
         # The kernel supports passing in a pre-allocated dz (e.g., in case we want to fuse the
-        # backward of selective_scan_cuda with the backward of chunk).
+        # backward of selective_scan_cuda with the backward of chunk.
         dxz = torch.empty_like(xz)  # (batch, dim, seqlen)
         dx, dz = dxz.chunk(2, dim=1)
         # dout_y = rearrange(dout, "b l d -> b d l") # because no arrange at end of forward, so dout shape is b d l
@@ -423,17 +423,29 @@ class MambaInnerFnNoOutProj(torch.autograd.Function):
         )
         # The kernel supports passing in a pre-allocated dx (e.g., in case we want to fuse the
         # backward of conv1d with the backward of chunk).
+        # causal_conv1d_bwd signature for v1.6.x: (x, weight, bias, dout, seq_idx, initial_states, 
+        #                                         dfinal_states, dx, dinitial_states, activation, 
+        #                                         dbias, return_dinitial_states, bool)
+        width = conv1d_weight.shape[-1]
+        initial_states_shape = (x.shape[0], x.shape[1], width - 1)
+        dinitial_states = torch.zeros(initial_states_shape, device=x.device, dtype=x.dtype)
+        if conv1d_bias is not None:
+            dbias = torch.zeros_like(conv1d_bias)
+        else:
+            dbias = None
         dx, dconv1d_weight, dconv1d_bias, *_ = causal_conv1d_cuda.causal_conv1d_bwd(
             x,
             conv1d_weight,
             conv1d_bias,
             dconv1d_out,
             None,
-            None,
-            None,
+            None,  # initial_states (not used in mamba_inner_fn pattern)
+            None,  # dfinal_states
             dx,
-            False,
-            True,
+            dinitial_states,
+            None,  # activation
+            dbias,
+            False,  # return_dinitial_states
         )
         dconv1d_bias = dconv1d_bias if conv1d_bias is not None else None
         dconv1d_weight = rearrange(dconv1d_weight, "d w -> d 1 w")
@@ -481,8 +493,8 @@ class MambaInnerFn(torch.autograd.Function):
         xz: (batch, dim, seqlen)
         """
         assert (
-            causal_conv1d_cuda is not None
-        ), "causal_conv1d_cuda is not available. Please install causal-conv1d."
+            causal_conv1d_fn is not None
+        ), "causal_conv1d_fn is not available. Please install causal-conv1d."
         assert checkpoint_lvl in [0, 1]
         L = xz.shape[-1]
         delta_rank = delta_proj_weight.shape[1]
@@ -503,8 +515,8 @@ class MambaInnerFn(torch.autograd.Function):
         conv1d_weight = rearrange(conv1d_weight, "d 1 w -> d w")
         x, z = xz.chunk(2, dim=1)
         conv1d_bias = conv1d_bias.contiguous() if conv1d_bias is not None else None
-        conv1d_out = causal_conv1d_cuda.causal_conv1d_fwd(
-            x, conv1d_weight, conv1d_bias, None, None, None, True
+        conv1d_out = causal_conv1d_fn(
+            x, conv1d_weight, conv1d_bias, activation="silu"
         )
         # We're being very careful here about the layout, to avoid extra transposes.
         # We want delta to have d as the slowest moving dimension
@@ -586,8 +598,8 @@ class MambaInnerFn(torch.autograd.Function):
     def backward(ctx, dout):
         # dout: (batch, seqlen, dim)
         assert (
-            causal_conv1d_cuda is not None
-        ), "causal_conv1d_cuda is not available. Please install causal-conv1d."
+            causal_conv1d_fn is not None
+        ), "causal_conv1d_fn is not available. Please install causal-conv1d."
         (
             xz,
             conv1d_weight,
@@ -613,14 +625,14 @@ class MambaInnerFn(torch.autograd.Function):
         if dout.stride(-1) != 1:
             dout = dout.contiguous()
         if ctx.checkpoint_lvl == 1:
-            conv1d_out = causal_conv1d_cuda.causal_conv1d_fwd(
-                x, conv1d_weight, conv1d_bias, None, None, None, True
+            conv1d_out = causal_conv1d_fn(
+                x, conv1d_weight, conv1d_bias, activation="silu"
             )
             delta = rearrange(
                 delta_proj_weight @ x_dbl[:, :delta_rank].t(), "d (b l) -> b d l", l=L
             )
         # The kernel supports passing in a pre-allocated dz (e.g., in case we want to fuse the
-        # backward of selective_scan_cuda with the backward of chunk).
+        # backward of selective_scan_cuda with the backward of chunk.
         dxz = torch.empty_like(xz)  # (batch, dim, seqlen)
         dx, dz = dxz.chunk(2, dim=1)
         dout = rearrange(dout, "b l e -> e (b l)")
@@ -686,17 +698,29 @@ class MambaInnerFn(torch.autograd.Function):
         )
         # The kernel supports passing in a pre-allocated dx (e.g., in case we want to fuse the
         # backward of conv1d with the backward of chunk).
+        # causal_conv1d_bwd signature for v1.6.x: (x, weight, bias, dout, seq_idx, initial_states, 
+        #                                         dfinal_states, dx, dinitial_states, activation, 
+        #                                         dbias, return_dinitial_states, bool)
+        width = conv1d_weight.shape[-1]
+        initial_states_shape = (x.shape[0], x.shape[1], width - 1)
+        dinitial_states = torch.zeros(initial_states_shape, device=x.device, dtype=x.dtype)
+        if conv1d_bias is not None:
+            dbias = torch.zeros_like(conv1d_bias)
+        else:
+            dbias = None
         dx, dconv1d_weight, dconv1d_bias, *_ = causal_conv1d_cuda.causal_conv1d_bwd(
             x,
             conv1d_weight,
             conv1d_bias,
             dconv1d_out,
             None,
-            None,
-            None,
+            None,  # initial_states (not used in mamba_inner_fn pattern)
+            None,  # dfinal_states
             dx,
-            False,
-            True,
+            dinitial_states,
+            None,  # activation
+            dbias,
+            False,  # return_dinitial_states
         )
         dconv1d_bias = dconv1d_bias if conv1d_bias is not None else None
         dconv1d_weight = rearrange(dconv1d_weight, "d w -> d 1 w")
@@ -746,6 +770,9 @@ class BiMambaInnerFn(torch.autograd.Function):
         """
         xz: (batch, dim, seqlen)
         """
+        assert (
+            causal_conv1d_fn is not None
+        ), "causal_conv1d_fn is not available. Please install causal-conv1d."
         assert checkpoint_lvl in [0, 1]
         L = xz.shape[-1]
         delta_rank = delta_proj_weight.shape[1]
@@ -766,8 +793,8 @@ class BiMambaInnerFn(torch.autograd.Function):
         conv1d_weight = rearrange(conv1d_weight, "d 1 w -> d w")
         x, z = xz.chunk(2, dim=1)
         conv1d_bias = conv1d_bias.contiguous() if conv1d_bias is not None else None
-        conv1d_out = causal_conv1d_cuda.causal_conv1d_fwd(
-            x, conv1d_weight, conv1d_bias, None, None, None, True
+        conv1d_out = causal_conv1d_fn(
+            x, conv1d_weight, conv1d_bias, activation="silu"
         )
         # We're being very careful here about the layout, to avoid extra transposes.
         # We want delta to have d as the slowest moving dimension
@@ -894,14 +921,14 @@ class BiMambaInnerFn(torch.autograd.Function):
         if dout.stride(-1) != 1:
             dout = dout.contiguous()
         if ctx.checkpoint_lvl == 1:
-            conv1d_out = causal_conv1d_cuda.causal_conv1d_fwd(
-                x, conv1d_weight, conv1d_bias, None, None, None, True
+            conv1d_out = causal_conv1d_fn(
+                x, conv1d_weight, conv1d_bias, activation="silu"
             )
             delta = rearrange(
                 delta_proj_weight @ x_dbl[:, :delta_rank].t(), "d (b l) -> b d l", l=L
             )
         # The kernel supports passing in a pre-allocated dz (e.g., in case we want to fuse the
-        # backward of selective_scan_cuda with the backward of chunk).
+        # backward of selective_scan_cuda with the backward of chunk.
         dxz = torch.empty_like(xz)  # (batch, dim, seqlen)
         dx, dz = dxz.chunk(2, dim=1)
         dout = rearrange(dout, "b l e -> e (b l)")
@@ -1005,17 +1032,29 @@ class BiMambaInnerFn(torch.autograd.Function):
         )
         # The kernel supports passing in a pre-allocated dx (e.g., in case we want to fuse the
         # backward of conv1d with the backward of chunk).
+        # causal_conv1d_bwd signature for v1.6.x: (x, weight, bias, dout, seq_idx, initial_states, 
+        #                                         dfinal_states, dx, dinitial_states, activation, 
+        #                                         dbias, return_dinitial_states, bool)
+        width = conv1d_weight.shape[-1]
+        initial_states_shape = (x.shape[0], x.shape[1], width - 1)
+        dinitial_states = torch.zeros(initial_states_shape, device=x.device, dtype=x.dtype)
+        if conv1d_bias is not None:
+            dbias = torch.zeros_like(conv1d_bias)
+        else:
+            dbias = None
         dx, dconv1d_weight, dconv1d_bias, *_ = causal_conv1d_cuda.causal_conv1d_bwd(
             x,
             conv1d_weight,
             conv1d_bias,
             dconv1d_out,
             None,
-            None,
-            None,
+            None,  # initial_states (not used in mamba_inner_fn pattern)
+            None,  # dfinal_states
             dx,
-            False,
-            True,
+            dinitial_states,
+            None,  # activation
+            dbias,
+            False,  # return_dinitial_states
         )
         dconv1d_bias = dconv1d_bias if conv1d_bias is not None else None
         dconv1d_weight = rearrange(dconv1d_weight, "d w -> d 1 w")

The selective_scan fix was me experimenting with opencode/Qwen3-coder-next, but output maps look almost identical to earlier EMReady2 implementation runs.

Conda env:

name: EMReady2
channels:
  - conda-forge
dependencies:
  - _openmp_mutex=4.5=20_gnu
  - bzip2=1.0.8=hda65f42_9
  - ca-certificates=2026.5.20=hbd8a1cb_0
  - ld_impl_linux-64=2.45.1=default_hbd61a6d_102
  - libexpat=2.8.1=hecca717_0
  - libffi=3.5.2=h3435931_0
  - libgcc=15.2.0=he0feb66_19
  - libgomp=15.2.0=he0feb66_19
  - liblzma=5.8.3=hb03c661_0
  - libmpdec=4.0.0=hb03c661_1
  - libsqlite=3.53.1=h0c1763c_0
  - libuuid=2.42.1=h5347b49_0
  - libzlib=1.3.2=h25fd6f3_2
  - ncurses=6.6=hdb14827_0
  - openssl=3.6.2=h35e630c_0
  - pip=26.1.2=pyh145f28c_0
  - python=3.14.5=habeac84_100_cp314
  - python_abi=3.14=8_cp314
  - readline=8.3=h853b02a_0
  - tk=8.6.13=noxft_h366c992_103
  - tzdata=2025c=hc9c84f9_1
  - zstd=1.5.7=hb78ec9c_6
  - pip:
      - annotated-doc==0.0.4
      - anyio==4.13.0
      - apache-tvm-ffi==0.1.9
      - biopython==1.87
      - causal-conv1d==1.6.2.post1
      - certifi==2026.5.20
      - click==8.4.1
      - cloudpickle==3.1.2
      - cuda-bindings==13.3.1
      - cuda-core==1.0.1
      - cuda-pathfinder==1.5.5
      - cuda-python==13.3.1
      - cuda-toolkit==13.0.2
      - einops==0.8.2
      - emready==0.1.0
      - filelock==3.29.0
      - fsspec==2026.4.0
      - h11==0.16.0
      - hf-xet==1.5.0
      - httpcore==1.0.9
      - httpx==0.28.1
      - huggingface-hub==1.17.0
      - idna==3.18
      - jinja2==3.1.6
      - llvmlite==0.47.0
      - mamba-ssm==2.3.2.post1
      - markdown-it-py==4.2.0
      - markupsafe==3.0.3
      - mdurl==0.1.2
      - ml-dtypes==0.5.4
      - monai==1.5.2
      - mpmath==1.3.0
      - mrcfile==1.5.4
      - networkx==3.6.1
      - ninja==1.13.0
      - numba==0.65.1
      - numpy==2.4.4
      - nvidia-cublas==13.1.1.3
      - nvidia-cuda-cupti==13.0.85
      - nvidia-cuda-nvrtc==13.0.88
      - nvidia-cuda-runtime==13.0.96
      - nvidia-cudnn-cu13==9.20.0.48
      - nvidia-cufft==12.0.0.61
      - nvidia-cufile==1.15.1.6
      - nvidia-curand==10.4.0.35
      - nvidia-cusolver==12.0.4.66
      - nvidia-cusparse==12.6.3.3
      - nvidia-cusparselt-cu13==0.8.1
      - nvidia-cutlass-dsl==4.5.2
      - nvidia-cutlass-dsl-libs-base==4.5.2
      - nvidia-nccl-cu13==2.29.7
      - nvidia-nvjitlink==13.0.88
      - nvidia-nvshmem-cu13==3.4.5
      - nvidia-nvtx==13.0.85
      - packaging==26.2
      - pillow==12.2.0
      - psutil==7.2.2
      - pygments==2.20.0
      - pyyaml==6.0.3
      - quack-kernels==0.5.0
      - regex==2026.5.9
      - rich==15.0.0
      - safetensors==0.7.0
      - scipy==1.17.1
      - setuptools==70.2.0
      - shellingham==1.5.4
      - sympy==1.14.0
      - tilelang==0.1.8
      - tokenizers==0.22.2
      - torch==2.12.0+cu130
      - torch-c-dlpack-ext==0.1.5
      - torchaudio==2.11.0+cu130
      - torchvision==0.27.0+cu130
      - tqdm==4.67.3
      - transformers==5.9.0
      - triton==3.7.0
      - typer==0.25.1
      - typing-extensions==4.15.0
      - wheel==0.47.0
      - z3-solver==4.15.4.0
prefix: /home/rbs-sci/.anaconda/EMReady2