Python PorterStemmer()对句子中的最后一个词进行不同的修饰

Python PorterStemmer()对句子中的最后一个词进行不同的修饰,python,nltk,porter-stemmer,Python,Nltk,Porter Stemmer,对于离线环境,我有以下代码: import pandas as pd import re from nltk.stem import PorterStemmer test = {'grams': ['First value because one does two THREE', 'Second value because three and three four', 'Third donkey three']} test = pd.DataFrame(test, columns = ['g

对于离线环境,我有以下代码:

import pandas as pd
import re
from nltk.stem import PorterStemmer

test = {'grams':  ['First value because one does two THREE', 'Second value because three and three four', 'Third donkey three']}
test = pd.DataFrame(test, columns = ['grams'])
STOPWORDS = {'and', 'does', 'because'}

def rower(x):
    cleanQ = []  
    for i in range(len(x)): cleanQ.append(re.sub(r'[\b\(\)\\\"\'\/\[\]\s+\,\.:\?;]', ' ', x[i]).lower())
    
    splitQ = []
    for row in cleanQ: splitQ.append(row.split())
    splitQ[:] = [[word for word in sub if word not in STOPWORDS] for sub in splitQ]
    splitQ = list(map(' '.join, splitQ))
    print(splitQ)
    
    originQ = []    
    for i in splitQ: 
        originQ.append(PorterStemmer().stem(i))
    print(originQ)
    
rower(test.grams)
这就产生了:

['first value one two three', 'second value three three four', 'third donkey three']
['first value one two thre', 'second value three three four', 'third donkey thre']
第一个列表显示了应用
PorterStemmer()
函数之前的句子。第二个列表显示了应用
PorterStemmer()
函数后的句子

如您所见,
PorterStemmer()
仅当单词定位为句子中的最后一个单词时,才会将单词
three
修剪为
thre
。当单词
three
不是最后一个单词时,
three
保持
three
。我似乎不明白它为什么这样做。我还担心,如果我将
rower(x)
函数应用于其他句子,可能会在我没有注意到的情况下产生类似的结果


如何防止PorterStemmer对最后一个单词的处理方式有所不同?

这里的主要错误是您将多个单词传递给词干分析器,而不是一次传递一个单词。整个字符串(第三个和第三个)被视为一个单词,最后一部分是词干

import pandas as pd
import re
from nltk.stem import PorterStemmer

test = {'grams': ['First value because one does two THREE', 'Second value because three and three four',
                  'Third donkey three']}
test = pd.DataFrame(test, columns=['grams'])
STOPWORDS = {'and', 'does', 'because'}

ps = PorterStemmer()

def rower(x):
    cleanQ = []
    for i in range(len(x)): cleanQ.append(re.sub(r'[\b\(\)\\\"\'\/\[\]\s+\,\.:\?;]', ' ', x[i]).lower())

    splitQ = []
    for row in cleanQ: splitQ.append(row.split())
    splitQ = [[word for word in sub if word not in STOPWORDS] for sub in splitQ]
    print('IN:', splitQ)
    originQ = [[ps.stem(word) for word in sent] for sent in splitQ]
    print('OUT:', originQ)


rower(test.grams)
输出:

IN: [['first', 'value', 'one', 'two', 'three'], ['second', 'value', 'three', 'three', 'four'], ['third', 'donkey', 'three']]
OUT: [['first', 'valu', 'one', 'two', 'three'], ['second', 'valu', 'three', 'three', 'four'], ['third', 'donkey', 'three']]
有很好的解释为什么词干省略了某些单词的最后一个“e”。如果输出不符合你的期望,考虑使用LeMaMixE./P>

将行更改为
originQ=[''.join([ps.stem(word)表示已发送的单词])表示已发送的拆分q]