Running 23 Languages in Production: Field Notes from a Multilingual Shopify Chatbot

Frederick Casey Housand - May 11, 2026

multilingual chatbot Shopify field notes 23 languages production runtime

A multilingual chatbot for Shopify is one that holds a real conversation in the languages your customers actually buy in, end to end, including the script their keyboard prints and the slang their region writes. Most “100+ language” claims describe the API recognition list, not the runtime that keeps a Berlin or Amsterdam customer in their own language all the way to checkout.

The honest measure of multilingual support is not how many language codes a vendor’s translate layer recognizes. It is how many languages a chatbot has been tuned on a real product catalog, with response models that have read the catalog’s product names, the store’s policy phrasing, and the team’s voice in the FAQs. Those are two different counts, and nobody publishes the second one. When I see a chatbot marketed at 100 or 7,000 languages, I assume the first count and discount the headline. I would rather have 23 tested languages than 7,000 untested ones, because shoppers do not type in language codes; they type in dialects, mixed-language sentences, and product names lifted off the catalog. The runtime either handles that gracefully or it does not, and the recognition list has no opinion on which side of the line a vendor falls on.

What follows is the runtime view, not the marketing view. I have run a 23-language chatbot across Shopify storefronts in the Netherlands, Germany, the UK, Italy, and France, and the failure modes I keep seeing are not the ones the SERP discusses. Detection edge cases, script handling, fallback chains, and the supply-side question of which languages are worth tuning at all. Those are what these notes cover. For the broader feature comparison across vendors, see the comparison hub .

Updated: May 2026.

What “23 languages” actually means at runtime

So what is the tuning count made of? Each of the 23 languages has had its response model read the catalog’s product names, the store’s policy phrasing, and the team’s voice in the FAQs. That is the slow, expensive number. The 100 or 7,000 cited elsewhere is the cheap one. That is just the codes a translate API can identify. Both numbers are real; only one of them predicts how the runtime will behave when a shopper writes in.

Per CSA Research (2022), 76% of online shoppers prefer their native language, and 40% will not buy from a site that does not serve them in it. The supply-side question, the one nobody answers in the SERP, is which languages a chatbot has actually been tested in on a live storefront.

The 23 cover the languages where install volume on the platform actually concentrates. Across the customer base I work with, US merchants make up 32.2% of installs, the Netherlands 12.7%, the UK 9.7%, and Germany 7.6%, with Italy and France filling out the next tier. There is no commercial argument for pretending to support Bengali if the catalog is a Dutch security-camera store. A 23-language number that maps to where shoppers actually arrive is a more honest commitment than a 100-language number that points at a translation API.

The take: pick the chatbot that publishes its tested language list, not the one that publishes the longest list. For the broader hub of Shopify chatbots, see the comparison hub .

Where language detection actually breaks

Every vendor product page assumes detection works. In production, detection is the thing that breaks first, on three recurring shapes of input.

Short messages. A four-character message (“merci?”) gives a detector almost nothing to work with. Confidence drops below threshold and the runtime has to choose: answer in the detected language and risk being wrong, or fall back and risk a tone mismatch. Short messages are the openers (“hi?”, “hola?”, “info?”) that decide whether the conversation continues, so the failure mode here has a direct revenue cost.

Mixed-language inputs. A Dutch shopper at IPcam-shop, a security-camera retailer in the Netherlands, will type a sentence like “hoe lang duurt levering van die ARGUS PT 4MP?”. The Dutch sentence frame is unambiguous to a human. Most detectors classify the whole message as English because the SKU dominates the character count. The runtime then answers in English to a Dutch shopper who was, three seconds ago, browsing a Dutch storefront.

Script-pair confusion on cognate languages. Spanish, Portuguese, and Italian share enough Latin-script vocabulary that a short message can be classified two or three ways. The disambiguating word usually arrives at character thirty, not character five. I have watched a Portuguese message get answered in Italian because the opener was three Latin words that happen to be the same in both languages.

The diagram below walks the fallback chain a runtime needs once detection drops below threshold. Most vendor pages stop at step one and pretend the rest does not happen.

Detection Fallback Chain

What the runtime does when confidence drops

Detect from message

High confidence

Long message, single script, common vocabulary. Answer in detected language.

95%

Confidence

Detect from prior turn

Medium

Short or mixed input, but the previous customer message was unambiguous. Carry forward.

78%

Confidence

Use store-default language

Lower

No prior turn. Use the language the merchant configured as the storefront default.

60%

Confidence

Use IP-derived locale

Lower still

No store-default for this surface. Use the geolocated language guess as a second fallback.

45%

Confidence

Use UI locale or last-message

Last resort

No reliable signal. Use whatever the widget UI is rendering in, or the language of the last bot turn.

25%

Confidence

The take: every multilingual chatbot has a fallback chain whether it documents one or not. Asking the vendor to draw it on a napkin will tell you more about their runtime than a pricing page ever will.

Scripts the headline number does not advertise

Once detection is solved, script handling reveals what the runtime has actually been tested on. The 23-language list breaks into four script families that fail in four different ways.

Latin-script languages are the easy case for detection and the hard case for tone. A French shopper expects warm formal address; the model has to pick “tu” or “vous” without an explicit instruction. An Italian customer at Puffo Sport expects regional dialect to be understood, even when answers come back in standard Italian. Most vendors that pass for “French” or “Italian” are answering in textbook editions of the language.

RTL scripts (Arabic, Hebrew) detect cleanly because the script is unambiguous. The runtime bottleneck is bidirectional rendering when the message mixes scripts. An Arabic customer typing a question with an English product name has a sentence that flows right-to-left around an embedded left-to-right token, and the widget UI is what breaks first, before the response model even runs.

CJK scripts (Chinese, Japanese, Korean) detect cleanly on character set, then fail on register. Japanese keigo and Korean speech levels are not a translation problem; they are a relationship-modeling problem. A vendor marketing 95 languages is confirming their translate API returns characters that render, not that the model knows how formal to be.

Transliteration is the case nobody on the SERP names. A Russian shopper typing in Latin characters (“zakaz”, “spasibo”) is a real input on any storefront with Russian-speaking customers. Detectors classify these as Polish or Slovenian, and the runtime has no graceful response unless the catalog has been ingested with the same transliterations the customer types.

The today-fresh signal: a French long-tail, “meilleur chatbot ia pour boutique shopify en 2026”, is landing forty-five impressions a week on a Shopify chatbot listicle that does not promise French depth. Position 5.5, zero clicks. That is a runtime-behavior gap, not a translation gap. For why catalog-search and chat work better when they share an ingestion layer, see the search-and-chat-together argument . For operators arriving from outside Shopify, the add-an-AI-chatbot-anywhere walkthrough covers the install path.

The take: the headline language count is silent on the script-pair the runtime has actually been tested on. Ask the vendor to demo a mixed-script message and watch the widget, not the model.

What changed when we tuned a language properly

The two named installs are the field-notes payoff. Both arrived at the same lesson by different routes.

At Puffo Sport, the test was native-conversation quality. Italian sporting-goods customers expect the bot to know the technical vocabulary of their sport, the regional phrasing of a sizing question, and the warm-Italian register the team uses in policy pages. Tuning meant ingesting the catalog’s product names, the team’s voice in policy pages, and the FAQ archive that documents how the team itself answers sizing questions. After tuning, customers were mistaking the bot for a human agent on a recurring basis. That is the bar I want every multilingual chatbot held to.

At IPcam-shop, the test was mixed Dutch / English SKU traffic. Security-camera customers in the Netherlands type product SKUs in English mid-sentence as a matter of course; the SKUs are how the cameras are named. Tuning meant the runtime could route the Dutch sentence-frame to a Dutch response while preserving the English SKU exactly as the catalog stores it. The detector still classified the message as English most days. The fix was not better detection; it was the response model knowing that “English” on this storefront means “answer in Dutch and quote the SKU verbatim.”

The visualization below maps the before/after on a single tuned language. Numbers are directional, drawn from observations at the named installs.

Before vs After Catalog Tuning

One language, properly tuned

Directional observations from Puffo Sport (Italy) and IPcam-shop (Netherlands)

Detection accuracy on short messages

Before

42%

After

88%

Conversation completion (no handoff)

Before

51%

After

79%

Customers mistaking bot for human

Before

After

34%

“After” reflects the runtime once the catalog, policy pages, and team voice have been ingested for the target language. Single-language tuning, not platform-wide claims.

The lesson the two installs converge on: language tuning is not a per-language prompt template. It is a per-catalog ingestion the response model reads as it generates. Catalog-tuned runtimes and translate-API passthrough produce different runtime behavior even when both report the same language code. The same ingestion layer pays off on the discovery surface; the AI Search side of catalog tuning is where it shows up first.

The take: a 23-language number that is honest about catalog tuning is a better commitment than a 100-language number that points at an API recognition list. I will defend that number publicly and let the longer numbers explain themselves.

Frequently asked questions

How many languages does a Shopify chatbot need?

Fifteen to twenty-five native languages cover the markets where conversion math moves for most multilingual stores. The install distribution at the stores I work with concentrates 62% of traffic in four countries (US, Netherlands, UK, Germany). A larger headline number rarely reflects more tested languages; it reflects a longer translate-API list.

What does “language detection” mean for a chatbot?

Language detection is the runtime decision about which language to respond in, based on the customer’s message text. It breaks on short messages (under five words), on mixed-language inputs with English SKUs inside non-English sentences, and on cognate Latin-script languages where the disambiguating word arrives late.

What happens when detection confidence is low?

A well-tuned runtime falls through an ordered chain: prior conversation turn, store-default language, IP-derived locale, then the widget UI’s rendering language as a last resort. Most vendor pages do not document this chain.

Are translate-API chatbots multilingual?

Technically yes, practically no. A translate API recognizes a language code and returns characters that render. A multilingual chatbot has been tuned on a catalog in that language and produces answers a native speaker would write. Headline language counts almost always describe the first while marketing the second.

If you are running multiple languages on a Shopify catalog

If your catalog reaches shoppers in more than one language and you want the runtime behavior described above, Shoply AI is the chatbot we built for this. The Shopify listing is at apps.shopify.com/shopping-assistant-by-shoplyai , and the live demo runs at demo.shoplyai.ai . For the broader feature comparison see the multilingual listicle , and for the wider Shopify chatbot set the comparison hub . Happy selling.