Stemming and Lemmatizing inside API.chat

This article explains what is used for message processing inside the API.chat engine.

User inputs usually come in form of sentences with inflected (or sometimes derived) words - even if they come from buttons. To better process input string on top of keyword matching and regex we also use tokenization which consists of lemmatizing and stemming as a fall-back.

Note that intent-based text processing is a much better option for understanding sentences and NLU overall, but for simple tasks, or button-only bots this method is the best.

What is Stemming?

Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes.

A stemmer for English operating on the stem cat should identify such strings as cats, catlike, and catty. A stemming algorithm might also reduce the words fishing, fished, and fisher to the stem fish. The stem might not be a word, for example, the Porter algorithm reduces, argue, argued, argues, arguing, and argus to the stem argu.

What is Lemmatizing?

Lemmatisation (or lemmatization) usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.

In many languages, words appear in several inflected forms. For example, in English, the verb 'to walk' may appear as 'walk', 'walked', 'walks', or 'walking'. The base form, 'walk', that one might look up in a dictionary, is called the lemma for the word. The association of the base form with a part of speech is often called a lexeme of the word.

Why do this?

The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. For instance:

am, are, is => be

car, cars, car's, cars' => car

It is essential for a proper understanding of user input and selecting a proper transition. For example, that allows you to write full sentences on buttons and catch them with only one lemma in the input attribute of transition.

How we use Lemmatizing?

Imagine user sent to chatbot Send me last week reports phrase. The phrase will be split into words, analyzed, lemmatized to form result like that

"lemma": [ "send", "me", "last", "week", "report" ]

After that original list of words combined with lemmas and sent to the state machine analyzer to find the best fit transition. That will be fsm step, and the transition response will be returned to the user as an answer to the phrase.

<transition input="report" next="Start">This might be a report. It can have text on any language|языке|ভাষা and even emoji 👌.</transition>

Supported lemmatizers

API.chat has support for several languages - several lemmatizers and stemmers. You can select language on chatbot create call by language property. This property only affects the used lemmatizer - you can set inputs and responses in any language you need. We support next languages:

  • English: 1,
  • French: 2,
  • Spain: 3,
  • Persian: 4,
  • Hindi: 5 (only stemming, limited capabilities)
  • Russian: 0

Note that you can not change the language after creation.

Build your chatbot 3x faster with API.chat

Contact us or sign up now.