Python 如何检查元组中是否包含单词，如果包含，则将其删除_Python_Pandas_Csv_Tuples_Nltk

Python 如何检查元组中是否包含单词，如果包含，则将其删除

python pandas csv

Python 如何检查元组中是否包含单词，如果包含，则将其删除,python,pandas,csv,tuples,nltk,Python,Pandas,Csv,Tuples,Nltk,我正在通过Pandas导入csv文件，格式如下： test = [ ('the beer was good.', 'pos'), ('I do not enjoy my job', 'neg'), ("I ain't feeling dandy today.", 'neg'), ("I feel amazing!", 'pos'), ('Gary is a friend of mine.', 'pos'), ("I can't believe I'

我正在通过Pandas导入csv文件，格式如下：

test = [
    ('the beer was good.', 'pos'),
    ('I do not enjoy my job', 'neg'),
    ("I ain't feeling dandy today.", 'neg'),
    ("I feel amazing!", 'pos'),
    ('Gary is a friend of mine.', 'pos'),
    ("I can't believe I'm doing this.", 'neg')
]

我想检查停止列表中的任何单词是否包含在定义的测试集中，如果是，请删除它们。然而，当我尝试这样做时，我只是得到了完整的列表，没有任何修改。以下是我当前的代码：

df = pd.read_csv('test.csv', delimiter=',')
tlist = [tuple(x) for x in df.values]
tlist = [(x.lower(), y.lower()) for x,y in tlist]

def remove_stopwords(train_list):
        new_list = []
        for word in train_list:
            if word not in stopwords.words('english'):
                new_list.append(word)
        print new_list

remove_stopwords(tlist)

我试图使用NLTK语料库提供的stopwords。正如我所说，当我使用打印（新列表）测试此代码时，所有发生的事情都是恢复tlist集，就像它一样。

像这样尝试：或者像这样：

for循环中的单词实际上是一个元组。因为tlist的形式是[（a1，b1），（a2，b2）]（元组列表）。现在将每个元组与stopwords中的一个单词进行比较。如果您这样做，您将看到：

def remove_stopwords(train_list):
        new_list = []
        for word in train_list:
            print(word)
            if word not in stopwords:
                new_list.append(word)
        print (new_list)

如果要删除单词，至少应该有两个循环，一个用于遍历列表，另一个用于遍历单词。类似这样的方法会奏效：

def remove_stopwords(train_list):
        new_list = []
        for tl in train_list:
            Words = tl[0].split()
            # tl would be  ('the beer was good.', 'pos')
            for word in Words: # words will be the , beer, was, good.
                if word not in stopwords:
                    new_list.append(word)
        print (new_list)

@Vardan的观点是绝对正确的。必须有两个循环，一个用于元组，另一个用于实际句子。但是，我们可以将字符串转换为标记并对照停止字检查，而不是获取原始数据（以字母表示）
下面的代码应该可以正常工作：

from nltk.corpus import stopwords from nltk.tokenize import word_tokenize import pandas as pd df = pd.read_csv('test.csv', delimiter=',') tlist = [tuple(x) for x in df.values] tlist = [(x.lower(), y.lower()) for x,y in tlist] def remove_stopwords(train_list): new_list = [] for word in train_list: total='' #take an empty buffer string word_tokens=word_tokenize(word[0]) #convert the first string in tuple into tokens for txt in word_tokens: if txt not in stopwords.words('english'): #Check each token against stopword total=total+' '+txt #append to the buffer new_list.append((total,word[1])) #append the total buffer along with pos/neg to list print new_list remove_stopwords(tlist) print tlist

为什么新清单是全球性的？为什么忽略remove_stopwords的返回值？使用python的index函数检查元素是否存在于列表或tuple@FooBar抱歉，复制了一些我正在测试的代码以检查某些内容。相应更新。大赏金你什么意思？您能进一步解释吗？请提供所需的输出。不幸的是，我尝试此操作的结果是“[]”。仅此而已。您一定是在某个地方出错了。您的csv加载和处理是否正确？查一下len（火车名单）啊，我错过了最后的打印电话。添加了它，但现在我看到，除了每行第二部分中的“pos”或“neg”之外，实际上没有删除任何单词，这是我需要的！我试图去掉停止词，所以像“and，a，the，to，it，my”等词。我添加了另一个解决方案。我刚刚尝试了这个方法，效果很好，但有点太好了。其结果是删除其中一个停止字的任何出现。例如，由于“a”是一个停止词，任何包含字母a的单词现在都会丢失它。所以“was”变成了“ws”。不仅如此，现在每个字母和空格在集合中都有自己的值，括号和pos/neg指示器被删除。所以，[[t]、[h]、[e]、[b]、[e]、[r]、[w]、[s]、[g]、[o]、[d]……是结果。我对代码进行了更改。我没有拆分句子。这应该可以。太棒了。谢谢！快乐编码：）
def remove_stopwords(train_list): new_list = [] for tl in train_list: Words = tl[0].split() # tl would be ('the beer was good.', 'pos') for word in Words: # words will be the , beer, was, good. if word not in stopwords: new_list.append(word) print (new_list)

from nltk.corpus import stopwords from nltk.tokenize import word_tokenize import pandas as pd df = pd.read_csv('test.csv', delimiter=',') tlist = [tuple(x) for x in df.values] tlist = [(x.lower(), y.lower()) for x,y in tlist] def remove_stopwords(train_list): new_list = [] for word in train_list: total='' #take an empty buffer string word_tokens=word_tokenize(word[0]) #convert the first string in tuple into tokens for txt in word_tokens: if txt not in stopwords.words('english'): #Check each token against stopword total=total+' '+txt #append to the buffer new_list.append((total,word[1])) #append the total buffer along with pos/neg to list print new_list remove_stopwords(tlist) print tlist