Guides / Managing results / Optimize search results / Handling natural languages

Oct. 20, 2023

Language-specific configurations

Setting the search language

Algolia doesn’t attempt to detect the language of an index automatically. If you want language-based settings like typo tolerance, stop words, and plurals to work correctly, you should tell the engine which language you want these settings to use.

If you don’t, the engine will use the default setting (all languages), which may result in anomalies such as applying French spellings to English words.

You can do this individually for each setting or more globally, with one setting per index.

Guide Set an index's query language

Merchandising Playbook Configuring language settings

Removing stop words

To separate a query’s key terms from its common words (such as “the”, “on”, and “it”), you can instruct the engine to ignore these common words and help the engine focus on the essentials of what people are looking for: nouns and adjectives.

Algolia references several sources (including Wiktionary and ranks.nl) to create a list of stop words in all supported languages.

Guide Customize stop words

Merchandising Playbook Configuring language settings

API reference removeStopWords

Ignoring plurals (and other alternative forms)

Algolia’s ignorePlurals parameter, if enabled, tells the engine to consider a word’s plural and singular forms as equivalent.

For example, in English, “cars” = “car” and “feet” = “foot”. To ensure completeness and support multiple languages, Algolia uses Wiktionary templates to declare alternative forms of a word. For example, the template {en-noun|s}, would show up like this on Wiktionary’s “car” page:

Copy
car (plural cars)

With Wiktionary templates, Algolia builds a dictionary of alternative forms. Almost every language has its own template syntax, and many languages have multiple templates.

Wiktionary templates also support other alternative forms:

German declension. A German noun changes form depending on its case, gender, number, and role in a sentence (dative, nominative, accusative, and genitive). German nouns can have numerous endings: -er, -e, -es, -e (for nominative), en, -e, -es, -e (accusative), -em, -er, -em, -en (dative), -es, -er, -es, -er (genitive).
Dutch diminutive endings. A Dutch noun changes its ending based on whether it’s small, countable, and other such nuances. For example, huisje is a small huis, and colaatje is a glass of cola.

Guides Customize plurals and other declensions

API reference ignorePlurals

Splitting compound words

Compound words refer to noun phrases (or nominal groups) that combine, without spaces, several words to form a single entity or idea.

An example is the German word “Hundehütte” (“dog house”).

The goal of decompounding is to index and search the individual words “Hund” and “Hütte” (“dog” and “house”) separately, thus improving the chance of a match.

For example, if a user searches for “Hütte für große Hunde” (“house for big dog”), but in your records, you only have the term “Hundehütte”. Without decompounding, Algolia can’t match these records. The query and records can only match if the records contain the compound word “Hundehütte” in its split form.

This setting supports six languages:

Dutch (nl)
German (de)
Finnish (fi)
Danish (da)
Swedish (sv)
Norwegian Bokmål (no).

Compound words are automatically split within:

All queries where queryLanguages contains one of the six supported languages
All attributes configured in decompoundedAttributes.

Splitting compound words doesn’t alter the records sent to Algolia. Compound words aren’t replaced by the segmented version but indexed in two formats: as the full word and as the atoms.

API reference queryLanguages

decompoundedAttributes

decompoundQuery

Words segmentation

In some logographic languages, words in queries or sentences aren’t separated by spaces as in Latin languages. The reader distinguishes each word based on the context. Since Algolia’s relevance matches words in the query with words in the records, it identifies which characters represent a word for a given query.

For example, “長い赤いドレス” in Japanese means “long red dress”. When receiving this query, Algolia segments it into its composing words “長い” (long), “赤い” (red), and “ドレス” (dress). The same segmentation happens on the records, ensuring a great match and relevance for Japanese queries.

Algolia supports segmentation in Chinese (zh) and Korean (only at query time) and in Japanese (ja) (at both query and indexing time). You must set the queryLanguages and indexLanguages to the relevant language code to ensure this segmentation applies.

Merchandising Playbook Configuring language settings

API reference queryLanguages

indexLanguages

Japanese transliteration and type-ahead

The Japanese language uses three writing systems: Kanji, Hiragana, and Katakana. When typing a query in Japanese, users first type its pronunciation in Hiragana and then convert it to Katakana or Kanji if relevant.

To ensure relevant results as soon as users start typing, not just when the query is complete, Algolia indexes Japanese words in both their original form and in Hiragana.

Transliteration is only available in Japanese (ja). To apply it, set the indexLanguages setting to ja. You can limit transliteration to some attributes or turn it off with the attributesToTransliterate setting.

Multiple conjugations can end up with the same transliteration.

You can use this feature with Query Suggestions to ensure Japanese users start seeing suggestions from the first keystrokes.

API reference queryLanguages

indexLanguages

transliteratedAttributes

Did you find this page helpful?

Language-specific configurations

On this page

Setting the search language

Removing stop words

Ignoring plurals (and other alternative forms)

Splitting compound words

Words segmentation

Japanese transliteration and type-ahead

On this page