Code前端首页关于Code前端联系我们

Python 数据科学教程:词干提取和词形还原(NLTK 库、Porter 词干提取算法)

terry 2年前 (2023-09-25) 阅读数 68 #后端开发

在自然语言处理领域,我们会遇到两个或多个单词具有共同词根的情况。例如,agreeagreeagree 具有相同的根。涉及任何这些单词的搜索都应将它们视为与根单词相同的单词。因此,将所有单词与其词根联系起来非常重要。 NLTK 库具有执行此链接并提供显示根单词的输出的方法。

以下程序使用Porter Stemming算法进行词干提取。

import nltk
from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()

word_data = "It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms"
# First Word tokenization
nltk_tokens = nltk.word_tokenize(word_data)
#Next find the roots of the word
for w in nltk_tokens:
       print ("Actual: %s  Stem: %s"  % (w,porter_stemmer.stem(w)))
Python

运行上面的代码示例,得到以下结果 -

Actual: It  Stem: It
Actual: originated  Stem: origin
Actual: from  Stem: from
Actual: the  Stem: the
Actual: idea  Stem: idea
Actual: that  Stem: that
Actual: there  Stem: there
Actual: are  Stem: are
Actual: readers  Stem: reader
Actual: who  Stem: who
Actual: prefer  Stem: prefer
Actual: learning  Stem: learn
Actual: new  Stem: new
Actual: skills  Stem: skill
Actual: from  Stem: from
Actual: the  Stem: the
Actual: comforts  Stem: comfort
Actual: of  Stem: of
Actual: their  Stem: their
Actual: drawing  Stem: draw
Actual: rooms  Stem: room
Shell

词形还原与词干提取类似,但它为单词带来了上下文。于是它进一步将意义相近的词组合成一个词。例如,如果一个段落包含汽车、火车和汽车等单词,它会将它们全部连接到汽车。在下面的程序中,使用WordNet词汇数据库进行词汇化。

import nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

word_data = "It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms"
nltk_tokens = nltk.word_tokenize(word_data)
for w in nltk_tokens:
    print ("Actual: %s  Lemma: %s"  % (w,wordnet_lemmatizer.lemmatize(w)))
Python

当我们运行上面的代码时,它会产生以下结果。

Actual: It  Lemma: It
Actual: originated  Lemma: originated
Actual: from  Lemma: from
Actual: the  Lemma: the
Actual: idea  Lemma: idea
Actual: that  Lemma: that
Actual: there  Lemma: there
Actual: are  Lemma: are
Actual: readers  Lemma: reader
Actual: who  Lemma: who
Actual: prefer  Lemma: prefer
Actual: learning  Lemma: learning
Actual: new  Lemma: new
Actual: skills  Lemma: skill
Actual: from  Lemma: from
Actual: the  Lemma: the
Actual: comforts  Lemma: comfort
Actual: of  Lemma: of
Actual: their  Lemma: their
Actual: drawing  Lemma: drawing
Actual: rooms  Lemma: room

版权声明

本文仅代表作者观点,不代表Code前端网立场。
本文系作者Code前端网发表,如需转载,请注明页面地址。

发表评论:

◎欢迎参与讨论,请在这里发表您的看法、交流您的观点。

热门