Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/346.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/vim/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何检查元组中是否包含单词,如果包含,则将其删除_Python_Pandas_Csv_Tuples_Nltk - Fatal编程技术网

Python 如何检查元组中是否包含单词,如果包含,则将其删除

Python 如何检查元组中是否包含单词,如果包含,则将其删除,python,pandas,csv,tuples,nltk,Python,Pandas,Csv,Tuples,Nltk,我正在通过Pandas导入csv文件,格式如下: test = [ ('the beer was good.', 'pos'), ('I do not enjoy my job', 'neg'), ("I ain't feeling dandy today.", 'neg'), ("I feel amazing!", 'pos'), ('Gary is a friend of mine.', 'pos'), ("I can't believe I'

我正在通过Pandas导入csv文件,格式如下:

test = [
    ('the beer was good.', 'pos'),
    ('I do not enjoy my job', 'neg'),
    ("I ain't feeling dandy today.", 'neg'),
    ("I feel amazing!", 'pos'),
    ('Gary is a friend of mine.', 'pos'),
    ("I can't believe I'm doing this.", 'neg')
]
我想检查停止列表中的任何单词是否包含在定义的测试集中,如果是,请删除它们。然而,当我尝试这样做时,我只是得到了完整的列表,没有任何修改。以下是我当前的代码:

df = pd.read_csv('test.csv', delimiter=',')
tlist = [tuple(x) for x in df.values]
tlist = [(x.lower(), y.lower()) for x,y in tlist]

def remove_stopwords(train_list):
        new_list = []
        for word in train_list:
            if word not in stopwords.words('english'):
                new_list.append(word)
        print new_list

remove_stopwords(tlist)
我试图使用NLTK语料库提供的stopwords。正如我所说,当我使用打印(新列表)测试此代码时,所有发生的事情都是恢复tlist集,就像它一样。

像这样尝试: 或者像这样:
for循环中的单词实际上是一个元组。因为tlist的形式是[(a1,b1),(a2,b2)](元组列表)。现在将每个元组与stopwords中的一个单词进行比较。如果您这样做,您将看到:

def remove_stopwords(train_list):
        new_list = []
        for word in train_list:
            print(word)
            if word not in stopwords:
                new_list.append(word)
        print (new_list)
如果要删除单词,至少应该有两个循环,一个用于遍历列表,另一个用于遍历单词。 类似这样的方法会奏效:

def remove_stopwords(train_list):
        new_list = []
        for tl in train_list:
            Words = tl[0].split()
            # tl would be  ('the beer was good.', 'pos')
            for word in Words: # words will be the , beer, was, good.
                if word not in stopwords:
                    new_list.append(word)
        print (new_list)

@Vardan的观点是绝对正确的。必须有两个循环,一个用于元组,另一个用于实际句子。 但是,我们可以将字符串转换为标记并对照停止字检查,而不是获取原始数据(以字母表示)

下面的代码应该可以正常工作:

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import pandas as pd
df = pd.read_csv('test.csv', delimiter=',')
tlist = [tuple(x) for x in df.values]
tlist = [(x.lower(), y.lower()) for x,y in tlist]
def remove_stopwords(train_list):
        new_list = []
        for word in train_list:
            total=''      #take an empty buffer string
            word_tokens=word_tokenize(word[0]) #convert the first string in tuple into tokens
            for txt in word_tokens: 
                    if txt not in stopwords.words('english'): #Check each token against stopword
                        total=total+' '+txt #append to the buffer
            new_list.append((total,word[1])) #append the total buffer along with pos/neg to list
        print new_list

remove_stopwords(tlist)
print tlist

为什么新清单是全球性的?为什么忽略remove_stopwords的返回值?使用python的index函数检查元素是否存在于列表或tuple@FooBar抱歉,复制了一些我正在测试的代码以检查某些内容。相应更新。大赏金你什么意思?您能进一步解释吗?请提供所需的输出。不幸的是,我尝试此操作的结果是“[]”。仅此而已。您一定是在某个地方出错了。您的csv加载和处理是否正确?查一下len(火车名单)啊,我错过了最后的打印电话。添加了它,但现在我看到,除了每行第二部分中的“pos”或“neg”之外,实际上没有删除任何单词,这是我需要的!我试图去掉停止词,所以像“and,a,the,to,it,my”等词。我添加了另一个解决方案。我刚刚尝试了这个方法,效果很好,但有点太好了。其结果是删除其中一个停止字的任何出现。例如,由于“a”是一个停止词,任何包含字母a的单词现在都会丢失它。所以“was”变成了“ws”。不仅如此,现在每个字母和空格在集合中都有自己的值,括号和pos/neg指示器被删除。所以,[[t]、[h]、[e]、[b]、[e]、[r]、[w]、[s]、[g]、[o]、[d]……是结果。我对代码进行了更改。我没有拆分句子。这应该可以。太棒了。谢谢!快乐编码:)
def remove_stopwords(train_list):
        new_list = []
        for tl in train_list:
            Words = tl[0].split()
            # tl would be  ('the beer was good.', 'pos')
            for word in Words: # words will be the , beer, was, good.
                if word not in stopwords:
                    new_list.append(word)
        print (new_list)
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import pandas as pd
df = pd.read_csv('test.csv', delimiter=',')
tlist = [tuple(x) for x in df.values]
tlist = [(x.lower(), y.lower()) for x,y in tlist]
def remove_stopwords(train_list):
        new_list = []
        for word in train_list:
            total=''      #take an empty buffer string
            word_tokens=word_tokenize(word[0]) #convert the first string in tuple into tokens
            for txt in word_tokens: 
                    if txt not in stopwords.words('english'): #Check each token against stopword
                        total=total+' '+txt #append to the buffer
            new_list.append((total,word[1])) #append the total buffer along with pos/neg to list
        print new_list

remove_stopwords(tlist)
print tlist