Stemming vs Lemmatization

Part 3A: Text Preprocessing for Topic Modelling

I love the word Lemmatization because the syllables, consonants, and vowels that make up the word lemmatization form a sound that makes me smile. I like Natural Language Processing (NLP) because I like to think about languages as a structure, pattern, and rhythm. Through this process, I can take a sentence and visualize it. Like a fun puzzle, languages are messy at first, but a comprehensive picture forms when the right pieces fit together.

I remember sitting in my high school English literature class with multiple sentences written on the board. We broke down each sentence and identified part of speech to understand how a word’s structure influences a sentence’s meaning. The act of breaking down a sentence to understand the structure falls under the discipline of linguistics. While as individuals, we can break down sentences and identify meaning, it becomes difficult and almost impossible to do so with large text documents such as 100,000,000 tweets or 100,000 news articles.

I love this figure taken from the textbook “Practical Natural Language Processing: A Comprehensive Guide to Building Real-World NLP Systems” by Vajjala, S. et al. 2020. The image shows that context, syntax, morphemes and lexemes, and phonemes are the language blocks. A variety of packages from NLTK, Spacy, TextBlob, and more, are Natural Language Processing applications that identify each language block. For example, to determine the context (meaning) with NLP, we can apply Topic Modelling and Sentiment Analysis.

Definition of Natural Language Processing

Natural Language Processing is part of the machine learning/AI pipeline. A variety of tasks are applied to process text data and format it so that the computer can read the data and perform analysis.

In this blog post I will discuss stemming and lemmatization, a pre-processing method for text data so that the text is ready to process for machine learning and rule-based algorithms. Specifically, in regards to topic modelling with the use of Twitter text data. This blog post is Part 3 in a series of posts in regard to collecting twitter data on the US Presidential Election.

Why use stemming and lemmatization?

Each document contains a vector of words (terms). Sentence tokenization separates each word into a matrix where each term is a feature. For example, if a sentence (Document 1) contains the term sit, and Document 2 has the term sitting. The two terms will end up as two separate column features even though the meaning is the same. However, if we reduce the word sitting to its root word sit, then the document matrix is reduced.

Stemming and Lemmatization are two different approaches for stripping a term within a document so that a document matrix reduces and the complexity of data decreases. Reducing the size and complexity of a model helps achieve model accuracy and reduce computation memory and time.

Stemming

Stemming is a text processing method in which a term reduces to its “stem” or simplest form by removing suffixes from the term such as (-ED, -ING, -ION, -IONS, -S). Suffixes are removed specifically for IR performance, not for linguistic meaning (Porter, 1980).

Types of Stemming:

Porter Stemmer
Lovinus Stemmer
Paice Stemmer

Stemming removes case letters at the end of a word. For example, when applying stemming to the word catastrophe the ‘e’ at the end of the word is removed.

catastrophe –> catostroph

Significance of stemming:
Stemming can be beneficial when developing search engines for query matching and feature space reduction when training a model (B., Sowmya V et al., 2020).

Example rules of Porter Stemmer:

SSES —> SS
caresses —> caress

IES —> I
ponies —> poni

SS —> SS
caress —> caress

S —>
cats —> cat

Import and instantiate Porter Stemmer:

# Import NLTK
import nltk
nltk.download('wordnet')
 
#import stem package
from nltk import stem
from nltk.corpus import wordnet
 
# stemmer
porter = stem.porter.PorterStemmer()

Let’s see some examples of how Porter Stemmer is applied:

# Create a word list
word_list1 = ['play', 'playing', 'played']
word_list2 = ['feet', 'foot', 'foots', 'footing']
word_list3 = ['organize', 'organizing', 'organization']
word_list4 = ['benefactor', 'benevolent', 'beneficial']
word_list5 = ['universe', 'university']
 
print("['play', 'playing', 'played'] -------------------->", [porter.stem(word) for word in word_list1])
print("['feet', 'foot', 'foots', 'footing'] -------------> ", [porter.stem(word) for word in word_list2])
print("['organize', 'organizing', 'organization'] -------> ", [porter.stem(word) for word in word_list3])
print("['benefactor', 'benevolent', 'beneficial'] -------> ", [porter.stem(word) for word in word_list4])
print("['universe', 'university'] -------> ", [porter.stem(word) for word in word_list5])

Output:
[‘play’, ‘playing’, ‘played’] ——————–> [‘play’, ‘play’, ‘play’]
[‘feet’, ‘foot’, ‘foots’, ‘footing’] ————-> [‘feet’, ‘foot’, ‘foot’, ‘foot’]
[‘organize’, ‘organizing’, ‘organization’] ——-> [‘organ’, ‘organ’, ‘organ’]
[‘benefactor’, ‘benevolent’, ‘beneficial’] ——-> [‘benefactor’, ‘benevol’, ‘benefici’]
[‘universe’, ‘university’] ——-> [‘univers’, ‘univers’]

Pros for stemming

Remove sufixes
Reduce the size and complexity of data
Reduce variance in models which can cause overfitting.

Cons for stemming

Stemming is too good at extracting the root word. One might say that stemming “butchers” the terms in a document. Stemming will take a word like organize and shorten it to organ which has an entirely different meaning. Same with the word University, which will stem to univers, again shortening the word to a word that is not the same meaning. When deciding to use Porter Stemmer, determine how important are the words meaning versus reducing the data’s complexity to your model and analysis.

Lemmatization

Lemmatization labels the term from its base word (lemma). This method is a more methodical approach for ensuring word reduction does not lose its meaning.

A. NLTK Lemmatization

# import lemmatizer package
from nltk.stem import WordNetLemmatizer

Examples of how Lemmatization is applied:

print("['play', 'playing', 'played'] -------------------->", [lem.lemmatize(word) for word in word_list1])
print("['feet', 'foot', 'foots', 'footing'] -------------> ", [lem.lemmatize(word) for word in word_list2])
print("['organize', 'organizing', 'organization'] -------> ", [lem.lemmatize(word) for word in word_list3])
print("['benefactor', 'benevolent', 'beneficial'] -------> ", [lem.lemmatize(word) for word in word_list4])
print("['universe', 'university'] -------> ", [lem.lemmatize(word) for word in word_list5])

Output:
[‘play’, ‘playing’, ‘played’] ——————–> [‘play’, ‘playing’, ‘played’]
[‘feet’, ‘foot’, ‘foots’, ‘footing’] ————-> [‘foot’, ‘foot’, ‘foot’, ‘footing’]
[‘organize’, ‘organizing’, ‘organization’] ——-> [‘organize’, ‘organizing’, ‘organization’]
[‘benefactor’, ‘benevolent’, ‘beneficial’] ——-> [‘benefactor’, ‘benevolent’, ‘beneficial’]
[‘universe’, ‘university’] ——-> [‘universe’, ‘university’]

We can see here that Lemmatization is more conservative about trimming a word then in stemming. University does not change to universe and organize/organization does not change to organ.

Handling plural in lemmatization

lemmatizer.lemmatize("ponies")
lemmatizer.lemmatize("caresses")
lemmatizer.lemmatize("cats")

      ponies —> pony
      caresses —> caress
      cats —> cat

Understanding Lemma

To better understand how lemma applies within linguistics, let’s take a trip down memory lane and recall verb conjugation.

Conjugate ‘To Be’

   Present Tense:
        I am
        You are
        He/She/It is

        We are
        You are
        They are

   Past Tense:
        I was
        You were
        He/She/It was

        We were
        You were
        They were

Thus, the lemma or lemmatization of AM, IS, ARE, WAS, and WERE is BE.

Code test:

print('I am ---> To', lemmatizer.lemmatize("am", pos="v")) #v is for verb”
print('You are --> To', lemmatizer.lemmatize("are", pos="v")) #v is for verb”
print('He is --> To', lemmatizer.lemmatize("is", pos="v")) #v is for verb”
print('They were --> To', lemmatizer.lemmatize("were", pos="v")) #v is for verb”

      I am —> To be
      You are –> To be
      He is –> To be
      They were –> To be

Note: In NLTK lemmatization, the part of speech (pos) needs to be defined. In the example above, I define the pos as “v” for verb. If the pos parameter is not specified, then the default is set to NOUN.

Complex verbs in NLTK Lemmatizer

Has the capability to identify base words from complex verbs. See example below:

lem.lemmatize('beheld', pos = 'v')
lem.lemmatize('withheld', pos = 'v')
lem.lemmatize('flung', pos = 'v')

      beheld —> behold
      witheld —> withhold
      flung —> fling

Adverbs in NLTK Lemmatizer

# adverb
print('farther', lem.lemmatize('farther', pos = 'r'))
# superlative adverb
print('farthest', lem.lemmatize('farthest', pos = 'r'))
 
# adverb
print('loudly', lem.lemmatize('loudly', pos = 'r'))
# superlative adverb
print('loudest', lem.lemmatize('loudest', pos = 'r'))

Output
      farther —> far
      farthest —> farthest
      loudly —> loudly

NLTK Lemmatizer can handle most adverbs, such as the words farther and loudly. However, adverbs that end in ‘est’, also known as superlative adverbs, is not supported by NLTK Lemmatizer. Therefore, loudest does not change to loud and farthest does not change to far.

Adjectives with different endings

# comparative adjective
print('closer', lem.lemmatize('closer', pos = 'a'))
 
#superlative adjective
print('closest', lem.lemmatize('closest', pos = 'a'))
 
# comparative adjective:
print('smaller', lem.lemmatize('smaller', pos = 'a'))
 
# superlative adjective
print('smallest', lem.lemmatize('smallest', pos = 'a'))
 
# Dry
print('drier', lem.lemmatize('drier', pos = 'a'))
print('driest', lem.lemmatize('driest', pos = 'a'))

Output       closer close
      closest close
      smaller small
      smallest small
      drier dry
      driest dry

NLTK Lemmatizer handles adjectives with different endings very well.

Pros for lemmatization

Using the base word ensures that the meaning behind the word is not lost

Cons for lemmatization

Need to identify the part of speech
Need to understand the fundamentals of linguistics thus more complex

B. Spacy Lemmatization

Spacy provides a lemmatization package. Let’s see how this package compares with NLTK lemmatization.

Spacy Lemmatization

In Spacy lemmatization, part of speech for each word is not a parameter.

# import spacy package
import spacy
 
# load English spacy
sp = spacy.load('en_core_web_sm')
 
# create word list
words = ['better','ran', 'are', 'running', 'were', 'shared', 'organize', 'university',
     	'awoken', 'arose', 'beheld', 'sped', 'withhold', 'flung', 'cats', 'timely', 'actively', 'tighter', 'smaller', 'farther', 'driest', 'farthest', 'loudly']
 
for i in words:
	token = sp(i)
	for word in token:
    	print(word.text,'-->',  word.lemma_)

Output:
      better –> well
      ran –> run
      are –> be
      running –> run
      were –> be
      shared –> share
      organize –> organize
      university –> university
      awoken –> awake
      arose –> arise
      beheld –> beheld
      sped –> speed
      withhold –> withhold
      flung –> flung
      cats –> cat
      timely –> timely
      actively –> actively
      tighter –> tight
      smaller –> small
      farther –> far
      driest –> dry
      farthest –> farthest
     loudly –> loudly

NLTK Lemamatizer

First I use nltk.pos_tag() to identify the part of speech for each word. This way, I can loop through each word when applying NLTK Lemmatizer. This is as well the method I would use for large Text Datasets in order to identify part of speech.

# create word list
words = ['better','ran', 'are', 'running', 'were', 'shared', 'organize', 'university',
     	'awoken', 'arose', 'beheld', 'sped', 'withhold', 'flung', 'cats', 'timely', 'actively', 'tighter', 'smaller', 'farther', 'driest', 'farthest', 'loudly']
        
#identify part of speech from word list
tags = nltk.pos_tag(words)
 
# create a list from part of speech
tag = list(dict(tags).values())
 
# rename part of speech parameter input in NLTK Lemmatizer
for i in range(0, len(tag)):
	if tag[i] == 'JJR' or tag[i] == 'JJ' or tag[i] == 'JJS':
    	tag[i] = 'a'
	elif tag[i] == 'VBP' or tag[i] == 'VBG' or tag[i] == 'VBD' or tag[i] == 'VBN':
    	tag[i] = 'v'
	elif tag[i] == 'NN' or tag[i]== 'NNS':
    	tag[i] = 'n'
	elif tag[i] == 'RB':
    	tag[i] = 'r'
	else:
    	pass
 
# Create a new list with the word list and Part of speech   
word_list = []
for i in range(0, len(tag)):
	word_list.append([words[i], tag[i]])
 
# Loop to apply NLTK Lemmatizer
nltk_words = []
for word in word_list:
	print(word[0],'-->', lem.lemmatize(word[0], pos = word[1]))
	nltk_words.append(lem.lemmatize(word[0], pos = word[1]))

Output
      better –> good
      ran –> ran
      are –> be
      running –> run
      were –> be
      shared –> share
      organize –> organize
      university –> university
      awoken –> awake
      arose –> arose
      beheld –> beheld
      sped –> speed
      withhold –> withhold
      flung –> flung
      cats –> cat
      timely –> timely
      actively –> actively
      tighter –> tight
      smaller –> small
      farther –> farther
      driest –> driest
      farthest –> farthest
      loudly –> loudly

Comparison

Spacy Lemmatization can identify the root word for verbs, plural, adjectives, and most adverbs, but cannot handle complex verbs such as words ‘flung’, ‘withhold’, and ‘beheld’. When setting the pos parameter to ‘v’, NLTK Lemmatizer changes flung to fling, however when I used NLTK.pos_tag() to identify part of speech, flung did not change. Other words that did not change because NLTK.pos_tag() did not correctly identify part of speech for driest, farther, arose, withhold, and beheld. NLTK.pos_tag() is necessary to use with NLTK Lemmatizer when working with thousands of documents.

References

B., Sowmya V., et al. Practical Natural Language Processing: a Comprehensive Guide to Building Real-World NLP Systems. O’Reilly Media, 2020.

Thomas, Rachel. “Topic Modeling with SVD & NMF (NLP Video 2).” YouTube, 8 July 2019, www.youtube.com/watch?v=tG3pUwmGjsc.

https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

Porter, Martin F. 1980. An algorithm for suffix stripping. Program 14 (3): 130-137.

http://www.nltk.org/_modules/nltk/stem/wordnet.html#WordNetLemmatizer

http://www.nltk.org/api/nltk.stem.html#nltk.stem.wordnet.WordNetLemmatizer.lemmatize

Written on May 3, 2021