Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/340.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 有什么方法可以防止我的WordNetLemmatizer将诸如“quot;can';";或;没有';t";?_Python_Python 3.x_Nltk_Lemmatization - Fatal编程技术网

Python 有什么方法可以防止我的WordNetLemmatizer将诸如“quot;can';";或;没有';t";?

Python 有什么方法可以防止我的WordNetLemmatizer将诸如“quot;can';";或;没有';t";?,python,python-3.x,nltk,lemmatization,Python,Python 3.x,Nltk,Lemmatization,下面的代码是我目前拥有的代码,它可以很好地工作,但它会将“没有”这样的单词改为“没有”和“t”。我希望它要么删除撇号,使其显示为“没有”,要么保留为“没有”,尽管这可能会导致以后TfidfVectorizer出现问题 有没有什么方法可以不费吹灰之力地实现这一点 def get_wordnet_pos(word): """Map POS tag to first character lemmatize() accepts"""

下面的代码是我目前拥有的代码,它可以很好地工作,但它会将“没有”这样的单词改为“没有”和“t”。我希望它要么删除撇号,使其显示为“没有”,要么保留为“没有”,尽管这可能会导致以后TfidfVectorizer出现问题

有没有什么方法可以不费吹灰之力地实现这一点

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

lemmatizer = WordNetLemmatizer()

def lemmatize_review(review):
    """Lemmatize single review string"""
    lemmatized_review = ' '.join([lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in word_tokenize(review)])
    return lemmatized_review

review_data['Lemmatized_Review'] = review_data['Review'].apply(lemmatize_review)

您只需将
“'”
字符替换为空字符
,然后继续进行柠檬化,如下所示:

>>> word = "didn't can't won't"
>>> word
"didn't can't won't"
>>> x = word.replace("'", "")
>>> x
'didnt cant wont'

您可以使用tweettokenizer而不是word tokenizer

from nltk.tokenize import TweetTokenizer

str = "didn't can't won't how are you"
tokenizer = TweetTokenizer()

tokenizer.tokenize(str)
#op
["didn't", "can't", "won't", 'how', 'are', 'you']