Add Gemma3n multimodal support with MobileNetV5 vision encoder #18256

simrnsingh · 2025-12-21T12:46:34Z

This PR implements multimodal (image) support for the Gemma3n model, which uses the MobileNetV5 architecture as its vision encoder (instead of SigLIP used in Gemma3).

Related issues

Partially Addresses #14429

Architecture Implementation

MobileNetV5 Vision Encoder:

Stem convolution layer with RMSNorm2d and GELU activation
Edge Residual Blocks with expansion, pointwise linear conv phases
Universal Inverted Residual blocks with expansion, depthwise conv, and projection phases
MobileNet Attention (Multi-Query Attention - MQA) blocks within the CNN architecture
Multi-Scale Fusion Adapter (MSFA) for combining features at different resolutions

Key Changes to existing code

convert_hf_to_gguf.py:
- Add Gemma3nVisionModel
- Add padding for special (vision) tokens to token embeddings (use full vocab size) to Gemma3NModel
  ~~-Fix chat template in Gemma3NModel by replacing <image_soft_token> and <audio_soft_token> with <__media__> the default marker~~
src/models/gemma3n-iswa.cpp:
- Add emb input support for vision embeddings inside get_per_layer_inputs
Relevant changes to files under /tools/mtmd/ to add mobilenetv5 vision encoder

Testing

Tested using mtmd cli

./llama-mtmd-cli \
  -m gemma3n_e2b-it.gguf \
  --mmproj mobilenetv5_e2b-it.gguf \
  --no-mmproj-offload \
  -fa off \
  -p "Describe this image" \
  --image image.png

image.png:

Output:

Captured from a slightly low angle, the image showcases a dynamic moment between a sleek black cat and a small, white mouse on a worn, blue wooden floor. The cat is in mid-leap, its front paws extended forward and its body angled towards the right, suggesting a pounce. Its tail is held high and curved, adding to the sense of motion. The cat's eyes are focused intently on the mouse.

The mouse is positioned to the right of the cat, appearing to dart away. It's small and white, contrasting sharply with the dark cat. The mouse is slightly blurred, emphasizing the cat's speed and focus.

The setting appears to be a child's bedroom. In the background, there's a bed with white bedding, a window letting in bright natural light, and various toys scattered around – a red ball, a yellow and orange toy, and a cardboard box with a cutout. A wooden chair and a dark object (possibly a speaker or a piece of furniture) are visible on the right side of the frame. The walls are a light beige color.

The lighting is bright, likely from the window, casting long shadows of both the cat and the mouse across the floor. The floorboards show signs of wear and tear, with visible cracks and discoloration. The overall mood is playful and captures a classic predator-prey interaction.

AI Usage Disclosure

Claude Code was used to explore the existing codebase, create boilerplates, initial drafts of funcs, classes and debugging & testing. Ultimately, the code has undergone heavy manual edits.

…ert_hf_to_gguf.py. Add gemma3n to vision projectors in gguf-py/gguf/constants.py.

ngxson · 2025-12-21T13:02:01Z

Before I can do my review, can you explicitly disclose any AI usage in this PR?

This is a requirement in our contribution guide: https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md

CISC · 2025-12-21T13:07:58Z

@ngxson Is this worth it, re your stance here? #17961 (comment)

ngxson · 2025-12-21T13:22:15Z

@CISC I stopped working on the other PR is because:

I know that mobilenetv5 will not have a good performance in ggml due to its convolution nature
there is no guarantee that they gonna reuse this vision architecture in their next model. as we already seen, both EmbeddingGemma and FunctionGemma use gemma3 text arch instead of gemma3n arch. so just my speculation, they will probably get rid of gemma3n and mobilenet arch altogether

however, if the current PR doesn't add much complexity (or most complex parts is isolated in its own space) - which seems to be the case here, probably worth reviewing/merging this to unblock use cases while waiting google to release the next vision model. if it's still mobilenet, we will optimize the implementation, otherwise we leave it as-is.

so my conclusion: it's still worth reviewing this PR, but don't need to be too optimized

simrnsingh · 2025-12-21T13:33:42Z

Before I can do my review, can you explicitly disclose any AI usage in this PR?

This is a requirement in our contribution guide: https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md

Hi @ngxson, I have updated the PR with AI disclosure.
Best,
Simranjeet

tools/mtmd/clip.cpp

tools/mtmd/clip-graph.h

tools/mtmd/clip.cpp

tools/mtmd/clip.h

convert_hf_to_gguf.py

2. Use available tensor mapping logic 3. Remove redundant chat template replacement of soft tokens placeholder with media placeholder

…struct and definitions to mobilenetv5.cpp 2.Remove unused `clip_is_gemma3n` func declarations and definitions 3. Remove redundant `rescale_image_u8_to_f32` func and use `normalize_image_u8_to_f32` with zero mean and unit std 4. Calculate n_patches using image_size / patch_size

simrnsingh · 2025-12-21T19:40:57Z

I’ve addressed all comments in the latest push and replied briefly inline with commit references. Requesting re-review when you have time.

ngxson · 2025-12-22T14:08:51Z

convert_hf_to_gguf.py

+
+        if name.startswith("vision_tower."):
+            tensor_suffix = name[13:]
+            return [(f"v.enc.{tensor_suffix}", data_torch)]


this code is now worse.

we require any tensors outside timm_model.blocks to be properly mapped using tensor_mapping.py and be processed through self.map_tensor_name

for double-indexed naming like model.vision_tower.timm_model.blocks.1.0.dw_mid.bn.weight, please explicitly state which tensors are being processed via a custom mapping in this class, Gemma3nVisionModel.tensor_mapping = {...}

tensor_mapping = { "model.vision_tower.timm_model.blocks.{bid}.{sid}.dw_start.bn.weight": "v.blk.{bid}.{sid}.dw_start.bn.weight", ... } if name.startswith("vision_tower.timm_model.blocks."): # mapping double-block naming return [(self.custom_map(name), data_torch)] else: return [(self.map_tensor_name(name), data_torch)]

The point is: any mapping should be explicit. The conversion script MUST throw an error on unknown tensors.

Okay I get the point. I will add a custom map (like the original commit). I will try to make use of the original mapping logic if I see it will cleaner, but make sure to route any uncatched case through the map_tensor_name.

ngxson · 2025-12-22T14:13:24Z

gguf-py/gguf/tensor_mapping.py

+
+        # Vision multimodal projector tensors (non-block) for gemma3n
+        MODEL_TENSOR.V_MM_INP_PROJ: (
+            "embedding_projection", # gemma3n


any vision-related tensors must be prefixed with either v. for the encoder or mm. for the projector

ngxson · 2025-12-22T14:16:33Z

tools/mtmd/clip.cpp

+                    model.mm_input_proj_w = get_tensor(TN_MM_INP_PROJ);
+                    model.mm_soft_emb_norm_w = get_tensor(TN_MM_SOFT_EMB_N);
+                    model.mm_0_w = get_tensor("mm.embedding.weight", false);  // Input embedding
+                    model.mm_1_w = get_tensor("mm.hard_emb_norm.weight", false);  // Hard embedding norm


hard-coding tensor names like this are not allowed. add #define for them

If I remember correctly, these two are not used at all. They appear in multimodal projector when token ids are given rather than embeddings from the vision encoder, and I just couldnt put my finger on why that would ever be the case.

class Gemma3nMultimodalEmbedder(nn.Module): """Embeds token ids or soft tokens for multimodal content into language model space.""" ... def forward( ... if inputs_embeds is not None: emb_norm = self.soft_embedding_norm(inputs_embeds) else: hard_emb = self.embedding(input_ids - self.vocab_offset) emb_norm = self.hard_embedding_norm(hard_emb)

I will remove the code altogether, as the vision projector seems to work fine without it.

ngxson · 2025-12-22T14:18:50Z

convert_hf_to_gguf.py

+        # Rename vision_tower.timm_model to vision_tower for cleaner naming
+        name = name.replace("vision_tower.timm_model.", "vision_tower.")
+
+        # Handle normalization layer naming
+        name = name.replace("hard_embedding_norm", "hard_emb_norm")
+        name = name.replace("soft_embedding_norm", "soft_emb_norm")


this is not recommended, do as little name.replace as possible. the rest must be handled by map_tensor_name or the custom naming described in the other comment

simrnsingh · 2025-12-23T15:11:29Z

I have noted all the points, I will get to it once I get some time.

ngxson · 2025-12-23T21:44:48Z

tools/mtmd/models/mobilenetv5.cpp

+    cur = ggml_conv_2d_direct(ctx0, model.mobilenet_stem_conv_w, cur, 2, 2, 0, 0, 1, 1);  // padding=0
+    if (model.mobilenet_stem_conv_b) {
+        // Bias is [C, 1, 1, 1], need to reshape to [1, 1, C, 1] for broadcasting to [W, H, C, B]
+        ggml_tensor * bias = ggml_reshape_4d(ctx0, model.mobilenet_stem_conv_b, 1, 1, cur->ne[2], 1);


any reshapes to model weight must be done in the conversion script to avoid problems with quantized data types

ngxson · 2025-12-23T21:46:30Z

tools/mtmd/models/mobilenetv5.cpp

+
+    // Apply Layer Scaling if present
+    if (block.layer_scale_w) {
+        ggml_tensor * scale_w_reshaped = ggml_reshape_4d(ctx0, block.layer_scale_w,


here also (and maybe more)

please scan everywhere in the file have this pattern, and do it in convert_hf_to_gguf.py instead

simrnsingh added 6 commits December 19, 2025 20:07

Add Gemma3nVisionModel - MobileNetV5 vision encoder convertor to conv…

3e4c8f8

…ert_hf_to_gguf.py. Add gemma3n to vision projectors in gguf-py/gguf/constants.py.

Add mobilenetv5 impl

ad5ed98

Fix comments, remove unused vars

f577054

Fix permute and remove transpose of projection weights

4589d3e

Merge branch 'master' into feat-gemma3n-vision

28d39cb

Fix comments, remove debugging prints from hf_to_gguf

47423a2

simrnsingh requested review from CISC and ngxson as code owners December 21, 2025 12:46

github-actions bot added model Model specific examples python python script changes labels Dec 21, 2025

loci-dev mentioned this pull request Dec 21, 2025

UPSTREAM PR #18256: Add Gemma3n multimodal support with MobileNetV5 vision encoder auroralabs-loci/llama.cpp#648

Closed

ngxson requested changes Dec 21, 2025

View reviewed changes

simrnsingh added 2 commits December 21, 2025 19:13

1. Hard-code image_mean = 0 and image_std = 1

67801e5

2. Use available tensor mapping logic 3. Remove redundant chat template replacement of soft tokens placeholder with media placeholder

simrnsingh requested a review from ngxson December 21, 2025 19:41

Remove obsolete comments

86618c7

ngxson requested changes Dec 22, 2025

View reviewed changes

ngxson self-assigned this Dec 22, 2025

ngxson mentioned this pull request Dec 22, 2025

[Mirror] Add Gemma3n multimodal support with MobileNetV5 vision encoder ngxson/llama.cpp#64

Open

ngxson requested changes Dec 23, 2025

View reviewed changes

Add Gemma3n multimodal support with MobileNetV5 vision encoder #18256

Are you sure you want to change the base?

Add Gemma3n multimodal support with MobileNetV5 vision encoder #18256

Conversation

simrnsingh commented Dec 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related issues

Architecture Implementation

Testing

AI Usage Disclosure

Uh oh!

ngxson commented Dec 21, 2025

Uh oh!

CISC commented Dec 21, 2025

Uh oh!

ngxson commented Dec 21, 2025

Uh oh!

simrnsingh commented Dec 21, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

simrnsingh commented Dec 21, 2025

Uh oh!

ngxson Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

simrnsingh Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

simrnsingh Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

simrnsingh commented Dec 23, 2025

Uh oh!

ngxson Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

simrnsingh commented Dec 21, 2025 •

edited

Loading

ngxson Dec 22, 2025 •

edited

Loading

ngxson Dec 22, 2025 •

edited

Loading