Language-specific configurations
On this page
When Algolia knows the language of your data and your users, the engine can apply word-based processing techniques, such as:
- Removing common (stop) words like âtheâ and âaâ
- Making singulars and plurals equivalent
- Detecting word roots
- Separating or combining compound words.
Setting the search language
Algolia doesnât attempt to detect the language of an index automatically. If you want language-based settings like typo tolerance, stop words, and plurals to work correctly, you should tell the engine which language you want these settings to use.
If you donât, the engine will use the default setting (all languages), which may result in anomalies such as applying French spellings to English words.
You can do this individually for each setting or more globally, with one setting per index.
Removing stop words
To separate a queryâs key terms from its common words (such as âtheâ, âonâ, and âitâ), you can instruct the engine to ignore these common words and help the engine focus on the essentials of what people are looking for: nouns and adjectives.
Algolia references several sources (including Wiktionary and ranks.nl) to create a list of stop words in all supported languages.
Ignoring plurals (and other alternative forms)
Algoliaâs ignorePlurals
parameter, if enabled, tells the engine to consider a wordâs plural and singular forms as equivalent.
For example, in English, âcarsâ = âcarâ and âfeetâ = âfootâ.
To ensure completeness and support multiple languages, Algolia uses Wiktionary templates to declare alternative forms of a word. For example, the template {en-noun|s}
, would show up like this on Wiktionaryâs âcarâ page:
1
car (plural cars)
With Wiktionary templates, Algolia builds a dictionary of alternative forms. Almost every language has its own template syntax, and many languages have multiple templates.
Wiktionary templates also support other alternative forms:
- German declension. A German noun changes form depending on its case, gender, number, and role in a sentence (dative, nominative, accusative, and genitive). German nouns can have numerous endings: -er, -e, -es, -e (for nominative), en, -e, -es, -e (accusative), -em, -er, -em, -en (dative), -es, -er, -es, -er (genitive).
- Dutch diminutive endings. A Dutch noun changes its ending based on whether itâs small, countable, and other such nuances. For example, huisje is a small huis, and colaatje is a glass of cola.
Splitting compound words
Compound words refer to noun phrases (or nominal groups) that combine, without spaces, several words to form a single entity or idea.
An example is the German word âHundehĂŒtteâ (âdog houseâ).
The goal of decompounding is to index and search the individual words âHundâ and âHĂŒtteâ (âdogâ and âhouseâ) separately, thus improving the chance of a match.
For example, if a user searches for âHĂŒtte fĂŒr groĂe Hundeâ (âhouse for big dogâ), but in your records, you only have the term âHundehĂŒtteâ. Without decompounding, Algolia canât match these records. The query and records can only match if the records contain the compound word âHundehĂŒtteâ in its split form.
This setting supports six languages:
- Dutch (
nl
) - German (
de
) - Finnish (
fi
) - Danish (
da
) - Swedish (
sv
) - Norwegian BokmÄl (
no
).
Compound words are automatically split within:
- All queries where
queryLanguages
contains one of the six supported languages - All attributes configured in
decompoundedAttributes
.
Splitting compound words doesnât alter the records sent to Algolia. Compound words arenât replaced by the segmented version but indexed in two formats: as the full word and as the atoms.
Words segmentation
In some logographic languages, words in queries or sentences arenât separated by spaces as in Latin languages. The reader distinguishes each word based on the context. Since Algoliaâs relevance matches words in the query with words in the records, it identifies which characters represent a word for a given query.
For example, âé·ă蔀ăăăŹăčâ in Japanese means âlong red dressâ. When receiving this query, Algolia segments it into its composing words âé·ăâ (long), â蔀ăâ (red), and âăăŹăčâ (dress). The same segmentation happens on the records, ensuring a great match and relevance for Japanese queries.
Algolia supports segmentation in Chinese (zh
) and Korean (only at query time) and in Japanese (ja
) (at both query and indexing time). You must set the queryLanguages
and indexLanguages
to the relevant language code to ensure this segmentation applies.
Japanese transliteration and type-ahead
The Japanese language uses three writing systems: Kanji, Hiragana, and Katakana. When typing a query in Japanese, users first type its pronunciation in Hiragana and then convert it to Katakana or Kanji if relevant.
To ensure relevant results as soon as users start typing, not just when the query is complete, Algolia indexes Japanese words in both their original form and in Hiragana.
Transliteration is only available in Japanese (ja
). To apply it, set the indexLanguages
setting to ja
. You can limit transliteration to some attributes or turn it off with the attributesToTransliterate
setting.
Multiple conjugations can end up with the same transliteration.
You can use this feature with Query Suggestions to ensure Japanese users start seeing suggestions from the first keystrokes.