-
Notifications
You must be signed in to change notification settings - Fork 14.2k
Add Gemma3n multimodal support with MobileNetV5 vision encoder #18256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…ert_hf_to_gguf.py. Add gemma3n to vision projectors in gguf-py/gguf/constants.py.
|
Before I can do my review, can you explicitly disclose any AI usage in this PR? This is a requirement in our contribution guide: https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md |
|
@ngxson Is this worth it, re your stance here? #17961 (comment) |
|
@CISC I stopped working on the other PR is because:
however, if the current PR doesn't add much complexity (or most complex parts is isolated in its own space) - which seems to be the case here, probably worth reviewing/merging this to unblock use cases while waiting google to release the next vision model. if it's still mobilenet, we will optimize the implementation, otherwise we leave it as-is. so my conclusion: it's still worth reviewing this PR, but don't need to be too optimized |
Hi @ngxson, I have updated the PR with AI disclosure. |
2. Use available tensor mapping logic 3. Remove redundant chat template replacement of soft tokens placeholder with media placeholder
…struct and definitions to mobilenetv5.cpp 2.Remove unused `clip_is_gemma3n` func declarations and definitions 3. Remove redundant `rescale_image_u8_to_f32` func and use `normalize_image_u8_to_f32` with zero mean and unit std 4. Calculate n_patches using image_size / patch_size
|
I’ve addressed all comments in the latest push and replied briefly inline with commit references. Requesting re-review when you have time. |
|
|
||
| if name.startswith("vision_tower."): | ||
| tensor_suffix = name[13:] | ||
| return [(f"v.enc.{tensor_suffix}", data_torch)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code is now worse.
we require any tensors outside timm_model.blocks to be properly mapped using tensor_mapping.py and be processed through self.map_tensor_name
for double-indexed naming like model.vision_tower.timm_model.blocks.1.0.dw_mid.bn.weight, please explicitly state which tensors are being processed via a custom mapping in this class, Gemma3nVisionModel.tensor_mapping = {...}
tensor_mapping = {
"model.vision_tower.timm_model.blocks.{bid}.{sid}.dw_start.bn.weight": "v.blk.{bid}.{sid}.dw_start.bn.weight",
...
}
if name.startswith("vision_tower.timm_model.blocks."):
# mapping double-block naming
return [(self.custom_map(name), data_torch)]
else:
return [(self.map_tensor_name(name), data_torch)]There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point is: any mapping should be explicit. The conversion script MUST throw an error on unknown tensors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I get the point. I will add a custom map (like the original commit). I will try to make use of the original mapping logic if I see it will cleaner, but make sure to route any uncatched case through the map_tensor_name.
|
|
||
| # Vision multimodal projector tensors (non-block) for gemma3n | ||
| MODEL_TENSOR.V_MM_INP_PROJ: ( | ||
| "embedding_projection", # gemma3n |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any vision-related tensors must be prefixed with either v. for the encoder or mm. for the projector
| model.mm_input_proj_w = get_tensor(TN_MM_INP_PROJ); | ||
| model.mm_soft_emb_norm_w = get_tensor(TN_MM_SOFT_EMB_N); | ||
| model.mm_0_w = get_tensor("mm.embedding.weight", false); // Input embedding | ||
| model.mm_1_w = get_tensor("mm.hard_emb_norm.weight", false); // Hard embedding norm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hard-coding tensor names like this are not allowed. add #define for them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remember correctly, these two are not used at all. They appear in multimodal projector when token ids are given rather than embeddings from the vision encoder, and I just couldnt put my finger on why that would ever be the case.
class Gemma3nMultimodalEmbedder(nn.Module):
"""Embeds token ids or soft tokens for multimodal content into language model space."""
...
def forward(
...
if inputs_embeds is not None:
emb_norm = self.soft_embedding_norm(inputs_embeds)
else:
hard_emb = self.embedding(input_ids - self.vocab_offset)
emb_norm = self.hard_embedding_norm(hard_emb)I will remove the code altogether, as the vision projector seems to work fine without it.
| # Rename vision_tower.timm_model to vision_tower for cleaner naming | ||
| name = name.replace("vision_tower.timm_model.", "vision_tower.") | ||
|
|
||
| # Handle normalization layer naming | ||
| name = name.replace("hard_embedding_norm", "hard_emb_norm") | ||
| name = name.replace("soft_embedding_norm", "soft_emb_norm") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not recommended, do as little name.replace as possible. the rest must be handled by map_tensor_name or the custom naming described in the other comment
|
I have noted all the points, I will get to it once I get some time. |
| cur = ggml_conv_2d_direct(ctx0, model.mobilenet_stem_conv_w, cur, 2, 2, 0, 0, 1, 1); // padding=0 | ||
| if (model.mobilenet_stem_conv_b) { | ||
| // Bias is [C, 1, 1, 1], need to reshape to [1, 1, C, 1] for broadcasting to [W, H, C, B] | ||
| ggml_tensor * bias = ggml_reshape_4d(ctx0, model.mobilenet_stem_conv_b, 1, 1, cur->ne[2], 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reshapes to model weight must be done in the conversion script to avoid problems with quantized data types
|
|
||
| // Apply Layer Scaling if present | ||
| if (block.layer_scale_w) { | ||
| ggml_tensor * scale_w_reshaped = ggml_reshape_4d(ctx0, block.layer_scale_w, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here also (and maybe more)
please scan everywhere in the file have this pattern, and do it in convert_hf_to_gguf.py instead
This PR implements multimodal (image) support for the Gemma3n model, which uses the MobileNetV5 architecture as its vision encoder (instead of SigLIP used in Gemma3).
Related issues
Partially Addresses #14429
Architecture Implementation
MobileNetV5 Vision Encoder:
Key Changes to existing code
convert_hf_to_gguf.py:-Fix chat template in Gemma3NModel by replacing<image_soft_token>and<audio_soft_token>with<__media__>the default markersrc/models/gemma3n-iswa.cpp:Relevant changes to files under /tools/mtmd/ to add mobilenetv5 vision encoder
Testing
Tested using mtmd cli
./llama-mtmd-cli \ -m gemma3n_e2b-it.gguf \ --mmproj mobilenetv5_e2b-it.gguf \ --no-mmproj-offload \ -fa off \ -p "Describe this image" \ --image image.pngimage.png:

Output:
AI Usage Disclosure
Claude Code was used to explore the existing codebase, create boilerplates, initial drafts of funcs, classes and debugging & testing. Ultimately, the code has undergone heavy manual edits.