在python中从列表中删除自定义单词_Python_List_Function_Stop Words

在python中从列表中删除自定义单词

python list function

在python中从列表中删除自定义单词,python,list,function,stop-words,Python,List,Function,Stop Words,我正在编写一个函数来执行自定义单词删除、词干分析（获取单词的词根形式）以及tf idf 我的函数输入数据是一个列表。如果我尝试在单个列表上删除自定义单词，这是可行的，但当我将其组合到函数中时，会出现属性错误： AttributeError:“list”对象没有属性“lower” 这是我的密码： def tfidf_kw(K): # Select docs in cluster K docs = np.array(mydata2)[km_r3.labels_==K]

我正在编写一个函数来执行自定义单词删除、词干分析（获取单词的词根形式）以及tf idf

我的函数输入数据是一个列表。如果我尝试在单个列表上删除自定义单词，这是可行的，但当我将其组合到函数中时，会出现属性错误：

AttributeError:“list”对象没有属性“lower”

这是我的密码：

def tfidf_kw(K):    
    # Select docs in cluster K
    docs = np.array(mydata2)[km_r3.labels_==K]

    ps= PorterStemmer()
    stem_docs = []
    for doc in docs:
        keep_tokens = []
        
        for token in doc.split(' '):
            #custom stopword removal
            my_list = ['model', 'models', 'modeling', 'modelling', 'python', 
           'train','training', 'trains', 'trained','test','testing', 'tests','tested']
            
            token  = [sub_token for sub_token in list(doc) if sub_token not in my_list]

            stem_token=ps.stem(token)
            keep_tokens.append(stem_token)

        keep_tokens =' '.join(keep_tokens)
        stem_docs.append(keep_tokens)

        return(keep_tokens)

进一步的代码是tf idf，它可以工作。这就是我需要帮助的地方，去理解我做错了什么

token  = [sub_token for sub_token in list(doc) if sub_token not in my_list]

以下是完整的错误：

AttributeError回溯（最近一次调用）
在里面
49#返回（已排序的#df）
50
--->51TFIDF_千瓦（0）
单位为tfidf_kw（K）
20
21
--->22 stem_令牌=ps.stem（令牌）
23保留令牌。追加（干令牌）
24
~/opt/anaconda3/lib/python3.8/site-packages/nltk/stem/porter.py in-stem（self，word）
650
651 def阀杆（自、字）：
-->652词干=单词。下（）
653
654如果self.mode==self.NLTK_扩展名和self.pool中的字：
AttributeError:“list”对象没有属性“lower”

在第51行，它说

tfidf_kw（0）

，这就是我检查函数k=0的地方。

显然

ps.stem

方法需要一个单词（一个字符串）作为参数，但您正在传递一个字符串列表

由于您已经在doc.split（“”）循环中的

标记中，因此对我来说，另外使用列表理解[…用于列表（doc）中的sub_标记]
似乎没有意义
如果您的目标是跳过my_list
中的那些令牌，那么您可能希望在doc.split（“”）

循环中为令牌编写

：
for token in doc.split(' '):
    my_list = ['model', 'models', 'modeling', 'modelling', 'python', 
   'train','training', 'trains', 'trained','test','testing', 'tests','tested']

    if token in my_list:
        continue
    
    stem_token=ps.stem(token)
    keep_tokens.append(stem_token)

在这里，如果token
是my_list
中的一个词，那么continue
语句跳过当前迭代的其余部分，循环继续执行下一个token
非常有效的操作，但是，对于k=0，我希望有124行，但是当我检查keep_tokens变量时，我只看到1行。i、 e上面的标记化只应用于1行。我注意到循环中有return
行-您可能应该将其移出循环，这可能解释了这种行为。您可能还应该返回stem\u docs
，而不是keep\u tokens
。就是这样！！！非常感谢！！！：）