Python 有什么方法可以防止我的WordNetLemmatizer将诸如“quot；can'；"；或；没有'；t"；？_Python_Python 3.x_Nltk_Lemmatization

Python 有什么方法可以防止我的WordNetLemmatizer将诸如“quot；can'；"；或；没有'；t"；？

python python-3.x

Python 有什么方法可以防止我的WordNetLemmatizer将诸如“quot；can'；"；或；没有'；t"；？,python,python-3.x,nltk,lemmatization,Python,Python 3.x,Nltk,Lemmatization,下面的代码是我目前拥有的代码，它可以很好地工作，但它会将“没有”这样的单词改为“没有”和“t”。我希望它要么删除撇号，使其显示为“没有”，要么保留为“没有”，尽管这可能会导致以后TfidfVectorizer出现问题有没有什么方法可以不费吹灰之力地实现这一点 def get_wordnet_pos(word): """Map POS tag to first character lemmatize() accepts"""

下面的代码是我目前拥有的代码，它可以很好地工作，但它会将“没有”这样的单词改为“没有”和“t”。我希望它要么删除撇号，使其显示为“没有”，要么保留为“没有”，尽管这可能会导致以后TfidfVectorizer出现问题

有没有什么方法可以不费吹灰之力地实现这一点

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

lemmatizer = WordNetLemmatizer()

def lemmatize_review(review):
    """Lemmatize single review string"""
    lemmatized_review = ' '.join([lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in word_tokenize(review)])
    return lemmatized_review

review_data['Lemmatized_Review'] = review_data['Review'].apply(lemmatize_review)

您只需将

“'”

字符替换为空字符

”

，然后继续进行柠檬化，如下所示：

>>> word = "didn't can't won't"
>>> word
"didn't can't won't"
>>> x = word.replace("'", "")
>>> x
'didnt cant wont'

您可以使用tweettokenizer而不是word tokenizer

from nltk.tokenize import TweetTokenizer

str = "didn't can't won't how are you"
tokenizer = TweetTokenizer()

tokenizer.tokenize(str)
#op
["didn't", "can't", "won't", 'how', 'are', 'you']