Python NLTK-停止字，在列表上散列_Python_List_Nltk

Python NLTK-停止字，在列表上散列

python list

Python NLTK-停止字，在列表上散列,python,list,nltk,Python,List,Nltk,我会尽可能地使这一点易于理解，因为我可以想象，长期的、旷日持久的问题会变得多么令人恼火我有一个tweets列表，所有tweets都存储在一个名为“all_tweets”的变量中。（这是因为一些推文属于“文本”类别，而其他推文属于“扩展推文”，所以我不得不将它们合并在一起我标记了这些推文，一切都很完美。我得到了每个推文的列表和推文中的每个单词都是分开的我现在尝试在代码中实现stopwords，这样我就可以过滤掉任何stopwords 我的代码如下： wordVec = [nltk.word_

我会尽可能地使这一点易于理解，因为我可以想象，长期的、旷日持久的问题会变得多么令人恼火

我有一个tweets列表，所有tweets都存储在一个名为“all_tweets”的变量中。（这是因为一些推文属于“文本”类别，而其他推文属于“扩展推文”，所以我不得不将它们合并在一起

我标记了这些推文，一切都很完美。我得到了每个推文的列表和推文中的每个单词都是分开的

我现在尝试在代码中实现stopwords，这样我就可以过滤掉任何stopwords

我的代码如下：

wordVec = [nltk.word_tokenize(tweet) for tweet in all_tweets]
stopWords = set(stopwords.words('english'))
wordsFiltered = []

for w in wordVec:
    if w not in stopWords:
        wordsFiltered.append(w)

我得到以下错误：

TypeError                                 Traceback (most recent call last)
<ipython-input-29-ae7a97fb3811> in <module>
      4 
      5 for w in wordVec:
----> 6     if w not in stopWords:
      7         wordsFiltered.append(w)

TypeError: unhashable type: 'list'

TypeError回溯（最近一次调用）
在里面
4.
5对于wordVec中的w：
---->6如果w不在停止字中：
7个字过滤后追加（w）
TypeError:不可损坏的类型：“列表”

我很清楚我不能在列表上乱翻。我看了我的推文，每一组单词都在自己的列表中。我很清楚发生了什么，但这个问题有解决办法吗

任何帮助都将不胜感激。

你说你很清楚发生了什么，但你知道吗？

wordVec

不是字符串列表，而是字符串列表

所以当你说：

用于wordVec中的w:

不是一个单词，而是一系列单词。也就是说，如果你说：

如果w不在stopWords中：

您正在询问当前单词列表是否在集合中。您无法将列表放入集合中，因为它们是可变的且无法散列，因此会出现错误

我猜你真正想做的是遍历单词列表，然后遍历当前列表中的单词

import nltk
from nltk.corpus import stopwords


tweets = [
    "Who here likes cheese? I myself like cheese.",
    "Do you have cheese? Do they have cheese?"
]

tokenized_tweets = [nltk.word_tokenize(tweet) for tweet in tweets]
stop_words = set(stopwords.words("english"))

filtered_tweets = []

for tokenized_tweet in tokenized_tweets:
    filtered_tweets.append(" ".join(word for word in tokenized_tweet if word.casefold() not in stop_words))

print(filtered_tweets)

输出：

['likes cheese ? like cheese .', 'cheese ? cheese ?']

我只是武断地决定加入过滤词列表，然后再将它们附加到

filtered\u tweets

列表中——正如您所看到的，这会导致标点符号被空格分隔，这可能是不可取的。在任何情况下，您不需要将这些词重新加入字符串，您只需将列表本身附加即可。

riable

wordVec

是一个列表列表，因此在执行以下操作时：

for w in wordVec:
    if w not in stopWords:

如果检查列表是否在集合中，

是一个列表，那么

TypeError: unhashable type: 'list'

您可以修复：

for w in wordVec:
    word_tokenize.append([e for e in w if e not in stop_words]))

或者您可以使用列表：

word_tokenize = [[e for e in w if e not in stop_words] for w in wordVec]

试试这个：

text = 'hello my friend, how are you today, are you ok?'
tokenized_word=word_tokenize(text)
stop_words = set(stopwords.words('english'))
stops = []
for w in tokenized_word:
    if w not in stop_words:
        stops.append(w)
print(stops)

我尝试了这个，似乎是一个合法的尝试，但当在我的word2vec模型中运行它时，它现在无法识别词汇表中的单词。如果您想让我更进一步，我可以向您展示这个结果，但我感谢您的尝试，谢谢you@Conor麦克纳利：当然，我很想看看。也许可以用更多我喜欢的东西更新你原来的帖子nfo/code？为了使这成为一个合理的答案，可以详细说明一个事实，即不是一次处理列表，而是处理其中的单个单词。