How to translate any image in your browser (2026 guide)

The browser-translation problem isn’t just manga. It’s anything where the text you want to read lives as pixels, not characters: doujin captions, foreign-language tweets in screenshots, JRPG dialogue, Pixiv tags, light-novel covers, product labels in foreign-language shopping pages. Web browsers can translate paragraph text natively — they can’t see inside an image. Until recently you had to.

Why text-in-image is a different problem than text-in-html

Google Translate, DeepL, every browser’s built-in translator — they all do the same trick: walk the page’s HTML, find text nodes, send those strings to a backend, swap them inline. None of that works for image content. The text isn’t in the DOM. It’s a pattern of pixels inside a JPEG or PNG.

Classic workarounds tried to bridge the gap and mostly failed:

Page-OCR translators ran optical character recognition on the whole image, then dumped a wall of translated text in a sidebar. Wrong reading order, no per-bubble context, ugly.
Browser-bundled OCR translation (a few extensions tried this) was either accurate-but-slow on big images or fast-but-character-set-limited (worked for English logos, choked on dense Japanese kanji).
Tab-out workflows (Google Lens, phone camera apps, desktop OCR utilities) meant you had to leave the page you were on. Friction killed adoption.

The 2026 pivot: image-edit AI

Two model categories changed the game. First, vision-capable LLMs — from Google, OpenAI, Anthropic and others — can see images natively. They don’t just transcribe characters; they understand which text belongs to which element, the relationship between text regions, and what tone the translation should carry. Second, and more important for in-panel translation, Google’s Gemini image-edit family can re-render an image with the translated text drawn back where the original text was — same bubble, same hand, art untouched. The OCR-then-translate-then-overlay pipeline collapses into a single model call.

For image translation specifically, this means:

Reading-order is solved without manual annotation. The model picks up that vertical Japanese reads right-to-left and that horizontal Korean reads left-to-right.
Per-region context is preserved. A speech bubble is a unit. A sound effect is a unit. A caption is a unit. Each gets translated in its own framing, not concatenated into a paragraph.
Tone is tunable. “Formal samurai dialogue” reads differently from “casual rom-com” and the model can be prompted to match. Old OCR pipelines couldn’t see tone at all.

Use cases beyond manga

The same engine that handles a manga panel works on any image with text. The use cases that benefit:

Manga / manhwa / manhua raws. The loud use case. Speech bubbles, narration boxes, sound effects, all rendered into the original layout.
Doujinshi and fan works. Same image-text shape as manga but for content the official translators will never touch.
Pixiv captions and artist tags. Most Pixiv text lives inside posted images. Auto-translating those unlocks an enormous amount of Japanese art context.
Foreign-language tweets / Reddit posts. People screenshot tweets constantly. The text is image content from the browser’s perspective.
JRPG / visual-novel / Asian-game screenshots. Whether you’re looking up a guide for an untranslated game or just want to read the dialogue in a screenshot someone posted, vision-LLM translation handles it cleanly.
Light-novel covers and book spines. Title text on covers is image content. Knowing what you’re looking at is half of curation.
Foreign signs, menus, product labels in screenshots. Travel research, online shopping from foreign stores, recipe sites in other languages.

Getting good results

A few practical notes for any image-translation workflow, regardless of which tool you use:

Pick an image-edit-capable model. For translation that puts English back where the bubble was — rather than just transcribing the text to a sidebar — you need a model that can re-renderthe image with edits. Google’s Gemini image-edit family is the standout in 2026; most other vision models can describe what they see but can’t re-draw the image with new text. MochiTranslate uses Gemini specifically for this reason.
Give the model context for tone. A samurai-period manga doesn’t sound like a slice-of-life comedy. Most tools let you pick a tone preference.
Translate a sample first if quality matters. Cheaper models are good enough for casual reading; for art books, poetry, or any text where word-choice carries weight, use a premium model for the parts you care about.
Build a glossary for long series. Recurring proper nouns and attack names benefit from a consistent translation across chapters. Some translators support glossaries; for ones that don’t, you can usually paste names into the prompt context.

Whatever’s in the image, in your language

Image-text translation in the browser finally works the way it should in 2026: no leaving the page, no desktop pipeline, no guessing at the source script. Hover an image, read it in your language.

The tool we built — MochiTranslate — runs on any <img> ≥ 200px on any site. Hover, translate. Whatever language is on the image, English back into the same layout. Same studio as MochiDim, same one-time pricing, no-tracking philosophy.

If this was useful

The extensions we make solve this end-to-end.

One-time payment, lifetime license, no tracking inside the extensions. The studio that wrote this article.

Get MochiTranslate More articles