The browser-translation problem isn’t just manga. It’s anything where the text you want to read lives as pixels, not characters: doujin captions, foreign-language tweets in screenshots, JRPG dialogue, Pixiv tags, light-novel covers, product labels in foreign-language shopping pages. Web browsers can translate paragraph text natively — they can’t see inside an image. Until recently you had to.
Why text-in-image is a different problem than text-in-html
Google Translate, DeepL, every browser’s built-in translator — they all do the same trick: walk the page’s HTML, find text nodes, send those strings to a backend, swap them inline. None of that works for image content. The text isn’t in the DOM. It’s a pattern of pixels inside a JPEG or PNG.
Classic workarounds tried to bridge the gap and mostly failed:
- Page-OCR translators ran optical character recognition on the whole image, then dumped a wall of translated text in a sidebar. Wrong reading order, no per-bubble context, ugly.
- Browser-bundled OCR translation (a few extensions tried this) was either accurate-but-slow on big images or fast-but-character-set-limited (worked for English logos, choked on dense Japanese kanji).
- Tab-out workflows (Google Lens, phone camera apps, desktop OCR utilities) meant you had to leave the page you were on. Friction killed adoption.
The 2026 pivot: image-edit AI
Two model categories changed the game. First, vision-capable LLMs — from Google, OpenAI, Anthropic and others — can see images natively. They don’t just transcribe characters; they understand which text belongs to which element, the relationship between text regions, and what tone the translation should carry. Second, and more important for in-panel translation, Google’s Gemini image-edit family can re-render an image with the translated text drawn back where the original text was — same bubble, same hand, art untouched. The OCR-then-translate-then-overlay pipeline collapses into a single model call.
For image translation specifically, this means:
- Reading-order is solved without manual annotation. The model picks up that vertical Japanese reads right-to-left and that horizontal Korean reads left-to-right.
- Per-region context is preserved. A speech bubble is a unit. A sound effect is a unit. A caption is a unit. Each gets translated in its own framing, not concatenated into a paragraph.
- Tone is tunable. “Formal samurai dialogue” reads differently from “casual rom-com” and the model can be prompted to match. Old OCR pipelines couldn’t see tone at all.
Use cases beyond manga
The same engine that handles a manga panel works on any image with text. The use cases that benefit:
- Manga / manhwa / manhua raws. The loud use case. Speech bubbles, narration boxes, sound effects, all rendered into the original layout.
- Doujinshi and fan works. Same image-text shape as manga but for content the official translators will never touch.
- Pixiv captions and artist tags. Most Pixiv text lives inside posted images. Auto-translating those unlocks an enormous amount of Japanese art context.
- Foreign-language tweets / Reddit posts. People screenshot tweets constantly. The text is image content from the browser’s perspective.
- JRPG / visual-novel / Asian-game screenshots. Whether you’re looking up a guide for an untranslated game or just want to read the dialogue in a screenshot someone posted, vision-LLM translation handles it cleanly.
- Light-novel covers and book spines. Title text on covers is image content. Knowing what you’re looking at is half of curation.
- Foreign signs, menus, product labels in screenshots. Travel research, online shopping from foreign stores, recipe sites in other languages.
Getting good results
A few practical notes for any image-translation workflow, regardless of which tool you use:
- Pick an image-edit-capable model. For translation that puts English back where the bubble was — rather than just transcribing the text to a sidebar — you need a model that can re-renderthe image with edits. Google’s Gemini image-edit family is the standout in 2026; most other vision models can describe what they see but can’t re-draw the image with new text. MochiTranslate uses Gemini specifically for this reason.
- Give the model context for tone. A samurai-period manga doesn’t sound like a slice-of-life comedy. Most tools let you pick a tone preference.
- Translate a sample first if quality matters. Cheaper models are good enough for casual reading; for art books, poetry, or any text where word-choice carries weight, use a premium model for the parts you care about.
- Build a glossary for long series. Recurring proper nouns and attack names benefit from a consistent translation across chapters. Some translators support glossaries; for ones that don’t, you can usually paste names into the prompt context.
Whatever’s in the image, in your language
Image-text translation in the browser finally works the way it should in 2026: no leaving the page, no desktop pipeline, no guessing at the source script. Hover an image, read it in your language.
The tool we built — MochiTranslate — runs on any <img> ≥ 200px on any site. Hover, translate. Whatever language is on the image, English back into the same layout. Same studio as MochiDim, same one-time pricing, no-tracking philosophy.
If this was useful
The extensions we make solve this end-to-end.
One-time payment, lifetime license, no tracking inside the extensions. The studio that wrote this article.