Python NLTK替换停止字_Python_List_Set_Nltk_Stop Words

Python NLTK替换停止字

python list

Python NLTK替换停止字,python,list,set,nltk,stop-words,Python,List,Set,Nltk,Stop Words,我正在使用NLTK将所有停止字替换为字符串“qqqq”。问题是，如果输入句子（我从中删除了停止词）有多个句子，那么它就不能正常工作我有以下代码： ex_text='This is an example list that has no special keywords to sum up the list, but it will do. Another list is a very special one this I like very much.' tokenized=word_tok

我正在使用NLTK将所有停止字替换为字符串

“qqqq”

。问题是，如果输入句子（我从中删除了停止词）有多个句子，那么它就不能正常工作

我有以下代码：

ex_text='This is an example list that has no special keywords to sum up the list, but it will do. Another list is a very special one this I like very much.'

tokenized=word_tokenize(ex_text)

stop_words=set(stopwords.words('english'))
stop_words.add(".")  #Since I do not need punctuation, I added . and ,
stop_words.add(",")

# I need to note the position of all the stopwords for later use
for w in tokenized:
    if w in stop_words:    
        stopword_pos.append(tokenized.index(w))

# Replacing stopwords with "QQQQQ"
for i in range(len(stopword_pos)):
    tokenized[stopword_pos[i]]='QQQQQ'  

print(tokenized)

该代码提供以下输出：

['This', 'QQQQQ', 'QQQQQ', 'example', 'list', 'QQQQQ', 'QQQQQ', 'QQQQQ', 'special', 'keywords', 'QQQQQ', 'sum', 'QQQQQ', 'QQQQQ', 'list', 'QQQQQ', 'QQQQQ', 'QQQQQ', 'QQQQQ', 'QQQQQ', 'QQQQQ', 'Another', 'list', 'is', 'QQQQQ', 'QQQQQ', 'special', 'one', 'QQQQQ', 'I', 'like', 'very', 'much', '.']

[0, 1, 2, 5, 6, 7, 10, 12, 13, 15, 16, 17, 18, 19, 20, 1, 24, 25, 0, 29, 25, 20]

正如您可能注意到的，它不会取代像“is”和“.”这样的停止词（我在集合中添加了fullstop，因为我不需要标点符号）

虽然请记住第一句中的“是”和“.”会被替换，但第二句中的“是”和“.”不会被替换

另一件奇怪的事情是，当我打印

stopword\u pos

时，我得到以下输出：

['This', 'QQQQQ', 'QQQQQ', 'example', 'list', 'QQQQQ', 'QQQQQ', 'QQQQQ', 'special', 'keywords', 'QQQQQ', 'sum', 'QQQQQ', 'QQQQQ', 'list', 'QQQQQ', 'QQQQQ', 'QQQQQ', 'QQQQQ', 'QQQQQ', 'QQQQQ', 'Another', 'list', 'is', 'QQQQQ', 'QQQQQ', 'special', 'one', 'QQQQQ', 'I', 'like', 'very', 'much', '.']

[0, 1, 2, 5, 6, 7, 10, 12, 13, 15, 16, 17, 18, 19, 20, 1, 24, 25, 0, 29, 25, 20]

正如您可能注意到的，这些数字似乎是按升序排列的，但突然之间，列表中的“20”之后出现了一个“1”，应该保留停止字的位置。此外，“29”之后有“0”，而“25”之后有“20”。也许这能说明问题所在

所以，问题是在第一句话之后，stopwords没有被替换为'qqqq's'。为什么呢

非常感谢任何能为我指明正确方向的事情。我不知道如何解决这个问题。

标记化。索引（w）

这将为您提供列表中第一个出现的项

因此，您可以尝试一些替代方法来替换stopwords，而不是使用索引

tokenized_new = [ word if word not in stop_words else 'QQQQQ' for word in tokenized ]

问题是，

.index

不会返回所有索引，因此，您需要类似于其他文档中提到的内容

在上面，我创建了

stopword\u pos\u set

，这样相同的索引不会被添加两次，它只会分配相同的值两次，但是当您打印

stopword\u pos

而不打印

set

时，您将看到重复的值

一个建议是，在上面的代码中，我将stop_words:中的w.lower（）改为

，这样当您检查stopwords
时，不区分大小写，否则'This'
与'This'
不同
另一个建议是使用.update
方法在停止单词中使用多个项目进行更新，并使用停止单词设置。更新（[“，”，“]）
而不是。添加多次

您可以尝试以下操作：
ex_text='This is an example list that has no special keywords to sum up the list, but it will do. Another list is a very special one this I like very much.'

tokenized = word_tokenize(ex_text)
stop_words = set(stopwords.words('english'))
stop_words.update([".", ","])  #Since I do not need punctuation, I added . and ,

stopword_pos_set = set()
# I need to note the position of all the stopwords for later use
for w in tokenized:
    if w.lower() in stop_words: 
        indices = [i for i, x in enumerate(tokenized) if x == w]
        stopword_pos_set.update(indices)

stopword_pos = sorted(list(stopword_pos_set)) # set to list

# Replacing stopwords with "QQQQQ"
for i in range(len(stopword_pos)):
    tokenized[stopword_pos[i]] = 'QQQQQ'  

print(tokenized)
print(stopword_pos)

我认为最好使用“index（）”以外的其他内容来查找索引。那么，如果列表中有多个项目出现，您如何找到索引呢？@FrontEnd Python您可以使用enumerate
，如学生的答案所示。我希望“stopword\u pos”列表按升序排列，但事实并非如此。以下是输出：[0、1、23、2、5、6、7、10、12、13、15、16、17、18、19、20、33、1、23、24、25、31、28、29、25、31、20、33]当转换为上面更新的列表时，您可以使用排序的，即停止字\u pos=sorted（列表（停止字\u pos集））。另外，我创建了
set`并将其转换为
列表
，以避免重复索引。添加和更新之间的区别是什么？为什么要使用更新？如果要添加多个项目，update
允许一起添加，但add
添加一个项目，您还可以签入为什么使用排序？它将如何分类？是的，非常感谢你的帮助。