Python 有什么方法可以防止我的WordNetLemmatizer将诸如“quot;can';";或;没有';t";?
下面的代码是我目前拥有的代码,它可以很好地工作,但它会将“没有”这样的单词改为“没有”和“t”。我希望它要么删除撇号,使其显示为“没有”,要么保留为“没有”,尽管这可能会导致以后TfidfVectorizer出现问题 有没有什么方法可以不费吹灰之力地实现这一点Python 有什么方法可以防止我的WordNetLemmatizer将诸如“quot;can';";或;没有';t";?,python,python-3.x,nltk,lemmatization,Python,Python 3.x,Nltk,Lemmatization,下面的代码是我目前拥有的代码,它可以很好地工作,但它会将“没有”这样的单词改为“没有”和“t”。我希望它要么删除撇号,使其显示为“没有”,要么保留为“没有”,尽管这可能会导致以后TfidfVectorizer出现问题 有没有什么方法可以不费吹灰之力地实现这一点 def get_wordnet_pos(word): """Map POS tag to first character lemmatize() accepts"""
def get_wordnet_pos(word):
"""Map POS tag to first character lemmatize() accepts"""
tag = pos_tag([word])[0][1][0].upper()
tag_dict = {"J": wordnet.ADJ,
"N": wordnet.NOUN,
"V": wordnet.VERB,
"R": wordnet.ADV}
return tag_dict.get(tag, wordnet.NOUN)
lemmatizer = WordNetLemmatizer()
def lemmatize_review(review):
"""Lemmatize single review string"""
lemmatized_review = ' '.join([lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in word_tokenize(review)])
return lemmatized_review
review_data['Lemmatized_Review'] = review_data['Review'].apply(lemmatize_review)
您只需将
“'”
字符替换为空字符”
,然后继续进行柠檬化,如下所示:
>>> word = "didn't can't won't"
>>> word
"didn't can't won't"
>>> x = word.replace("'", "")
>>> x
'didnt cant wont'
您可以使用tweettokenizer而不是word tokenizer
from nltk.tokenize import TweetTokenizer
str = "didn't can't won't how are you"
tokenizer = TweetTokenizer()
tokenizer.tokenize(str)
#op
["didn't", "can't", "won't", 'how', 'are', 'you']