Understanding NLP Basics: Tokenization, Stemming, and Lemmatization
Natural Language Processing (NLP) is a fascinating field that enables computers to understand, interpret, and generate human language. In this post, we’ll dive into some fundamental NLP techniques: tokenization, stemming, and lemmatization.
Introduction to NLP
The process of NLP often begins with breaking down raw text into smaller, manageable units. This allows us to analyze and process the text more effectively. The nltk (Natural Language Toolkit) library in Python is a powerful tool for these tasks.
import nltk
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import re
from nltk.stem import WordNetLemmatizer
import os
# Ensure NLTK data is available
# Optional: clear existing nltk_data (if corrupted)
nltk.data.path.clear()
# Force path to a clean directory
nltk_data_path = os.path.abspath('nltk_data')
nltk.download('punkt', download_dir=nltk_data_path) # Changed from punkt_tab to punkt for broader compatibility
nltk.download('wordnet', download_dir=nltk_data_path)
nltk.download('stopwords', download_dir=nltk_data_path)
nltk.data.path.append(nltk_data_path)
Tokenization
Tokenization is the process of breaking text into smaller units called tokens. These tokens can be words, sentences, or even subword units.
Sentence Tokenization
Sentence tokenization, or sentencizing, is the process of splitting a continuous text into a list of sentences.
corpus = """These are the Terms and Conditions governing the use of this Service and the agreement
that operates between You and the Company. These Terms and Conditions set out the rights and obligations of
all users regarding the use of the Service. Your access to and use of the Service is conditioned on Your
acceptance of and compliance with these Terms and Conditions. These Terms and Conditions apply to all visitors,
users and others who access or use the Service. By accessing or using the Service You agree to be bound by these
Terms and Conditions. If You disagree with any part of these Terms and Conditions then You may not access the Service.
You represent that you are over the age of 18. The Company does not permit those under 18 to use the Service.
Your access to and use of the Service is also conditioned on Your acceptance of and compliance with the Privacy Policy
of the Company. Our Privacy Policy describes Our policies and procedures on the collection, use and disclosure of
Your personal information when You use the Application or the Website and tells You about Your privacy rights and
how the law protects You. Please read Our Privacy Policy carefully before using Our Service. Intellectual Property
The Service and its original content (excluding Content provided by You or other users), features and functionality
are and will remain the exclusive property of the Company and its licensors. The Service is protected by copyright,
trademark, and other laws of both the Country and foreign countries. Our trademarks and trade dress may not be
used in connection with any product or service without the prior written consent of the Company."""
documents = nltk.sent_tokenize(corpus, language='english')
print(documents[:2]) # Print the first two sentences to demonstrate
b Output:
['These are the Terms and Conditions governing the use of this Service and the agreement
that operates between You and the Company.', 'These Terms and Conditions set out the rights and obligations of
all users regarding the use of the Service.']
Word Tokenization
Word tokenization is the process of splitting a sentence into individual words. This is a crucial step for many NLP tasks, such as text analysis and feature extraction.
Before word tokenization, it’s often beneficial to clean the text by removing special characters and converting it to lowercase.
cleaned_corpus = []
for i in range(len(documents)):
review = re.sub('[^a-zA-Z]', ' ', documents[i])
review = review.lower()
cleaned_corpus.append(review)
print(cleaned_corpus[0])
Output:
these are the terms and conditions governing the use of this service and the agreement that operates between you and the company
Now, we can perform word tokenization on the cleaned sentences:
words = nltk.word_tokenize(cleaned_corpus[0])
print(words)
Output:
['these', 'are', 'the', 'terms', 'and', 'conditions', 'governing', 'the', 'use', 'of', 'this',
'service', 'and', 'the', 'agreement', 'that', 'operates', 'between', 'you', 'and', 'the', 'company']
Stemming
Stemming is a technique used to reduce words to their root or base form, known as a “stem.” The stem may not be a valid word itself, but it’s useful for reducing inflected words to a common base. The Porter Stemmer is a widely used algorithm for this purpose.
stemmer = PorterStemmer()
print(stemmer.stem('goes'))
print(stemmer.stem('going'))
print(stemmer.stem('gone'))
Output:
go
go
gone
Here’s an example of applying stemming to our cleaned corpus, excluding stopwords:
from nltk.corpus import stopwords # Ensure stopwords are imported if not already
for sentence in cleaned_corpus:
words = nltk.word_tokenize(sentence)
stemmed_words = [stemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
print(' '.join(stemmed_words[:10])) # Print first 10 stemmed words for brevity
Output (example of first sentence):
term condit govern use servic agreement oper compani term condit
Lemmatization
Lemmatization is a more sophisticated technique than stemming that reduces words to their base form, called a “lemma.” Unlike stemming, lemmatization ensures that the reduced form is a valid word in the language. It achieves this by using a vocabulary and morphological analysis of words.
lemme = WordNetLemmatizer()
print(lemme.lemmatize('goes'))
print(lemme.lemmatize('going'))
print(lemme.lemmatize('gone'))
Output:
go
going
gone
As you can see, lemme.lemmatize('goes') correctly returns go, which is a valid word. However, going remains going as it is considered a valid word in its context, and gone remains gone. This highlights the difference in how lemmatization handles words compared to stemming.
Here’s an example of applying lemmatization to our cleaned corpus, again excluding stopwords:
for sentence in cleaned_corpus:
words = nltk.word_tokenize(sentence)
lemmatized_words = [lemme.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
print(' '.join(lemmatized_words[:10])) # Print first 10 lemmatized words for brevity
Output (example of first sentence):
term condition governing use service agreement operates company term condition
TF-IDF
TF-IDF stands for Term Frequency-Inverse Document Frequency. It’s a numerical statistic that reflects how important a word is to a document in a collection or corpus. TF-IDF is often used as a weighting factor in information retrieval and text mining. It increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.
Term Frequency (TF)
Term Frequency measures how frequently a term appears in a document. Since every document is different in length, it is possible that a term would appear more times in longer documents than shorter ones. Thus, the term frequency is often divided by the document length (e.g., total number of words in the document) as a way of normalization:
[TF(t,d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d}]
Inverse Document Frequency (IDF)
Inverse Document Frequency measures how important a term is. While computing TF, all terms are considered equally important. However, it is known that certain terms, such as “is,” “of,” and “that,” may appear many times but have little actual importance. Thus, we need to weigh down the frequent terms while scaling up the rare ones.
[IDF(t, D) = \log\left(\frac{\text{Total number of documents } N}{\text{Number of documents with term } t}\right)]
TF-IDF Calculation
The TF-IDF score is then calculated by multiplying the TF and IDF values:
[TF-IDF(t,d,D) = TF(t,d) \times IDF(t,D)]
Implementing TF-IDF with sklearn
We can use TfidfVectorizer from sklearn.feature_extraction.text to calculate TF-IDF scores for our corpus.
First, let’s revisit our cleaned_corpus after lemmatization and stopword removal. We can reuse the clean_text function for a more robust cleaning process if we were to load a new dataset, but for our existing corpus variable, the cleaned_corpus is already prepared.
from sklearn.feature_extraction.text import TfidfVectorizer
cv = TfidfVectorizer()
y = cv.fit_transform(cleaned_corpus)
print("Shape of TF-IDF matrix:", y.shape)
print("First document's TF-IDF vector (sparse representation):\n", y[0])
print("First document's TF-IDF vector (dense array representation):\n", y[0].toarray())
print("Vocabulary learned by TF-IDF vectorizer:", cv.vocabulary_)
Output (example - actual output may vary based on corpus content and vocabulary):
Shape of TF-IDF matrix: (8, 62)
First document's TF-IDF vector (sparse representation):
(0, 15) 0.2319047910903822
(0, 52) 0.2319047910903822
(0, 48) 0.2319047910903822
(0, 41) 0.2319047910903822
(0, 39) 0.2319047910903822
(0, 31) 0.2319047910903822
(0, 26) 0.2319047910903822
(0, 25) 0.2319047910903822
(0, 20) 0.2319047910903822
(0, 18) 0.2319047910903822
(0, 17) 0.2319047910903822
(0, 12) 0.2319047910903822
(0, 7) 0.2319047910903822
(0, 2) 0.2319047910903822
(0, 0) 0.2319047910903822
First document's TF-IDF vector (dense array representation):
[[0.23190479 0. 0.23190479 0. 0. 0.
0. 0.23190479 0. 0. 0. 0.
0.23190479 0. 0. 0.23190479 0. 0.23190479
0.23190479 0. 0.23190479 0. 0. 0.
0. 0.23190479 0.23190479 0. 0. 0.
0. 0.23190479 0. 0. 0. 0.
0. 0. 0. 0.23190479 0. 0.23190479
0. 0. 0. 0. 0. 0.
0.23190479 0. 0. 0. 0.23190479 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. ]]
Vocabulary learned by TF-IDF vectorizer: {'term': 52, 'condition': 15, 'governing': 25, 'use': 56, 'service': 48,
'agreement': 2, 'operates': 39, 'company': 12, 'set': 49, 'right': 45, 'obligation':
38, 'user': 55, 'regarding': 44, 'access': 0, 'conditioned': 16, 'acceptance': 1, 'compliance': 14, 'apply': 4,
'visitor': 57, 'others': 40, 'accessing': 5, 'using': 58, 'agree': 3,
'bound': 8, 'disagree': 21, 'part': 42, 'may': 37, 'represent': 46, 'age': 6, 'permit': 43, 'also': 7, 'privacy': 4