如何在Python中从列表列表中删除常用词？_Python_List_Loops_Remove

如何在Python中从列表列表中删除常用词？

python list loops

如何在Python中从列表列表中删除常用词？,python,list,loops,remove,Python,List,Loops,Remove,我有很多“组”字。如果一个组中的任何单词同时出现在A列和B列中，我想从两列中删除该组中的单词。如何循环所有组（即列表中的子列表）下面有缺陷的代码只删除最后一组中的常用词，而不是stuff中的所有三组（列表）。[如果组中的一个单词在字符串中，我首先创建一个指示符，如果两个字符串都有组中的一个单词，我再创建另一个指示符。仅对于a和B对中都有组中一个单词的情况，我删除特定的组词。] 如何正确指定循环编辑：在我建议的代码中，每个循环都是使用原始列重新启动的，而不是使用从以前的组中删除的单词在列上循

我有很多“组”字。如果一个组中的任何单词同时出现在A列和B列中，我想从两列中删除该组中的单词。如何循环所有组（即列表中的子列表）

下面有缺陷的代码只删除最后一组中的常用词，而不是stuff中的所有三组（列表）。[如果组中的一个单词在字符串中，我首先创建一个指示符，如果两个字符串都有组中的一个单词，我再创建另一个指示符。仅对于a和B对中都有组中一个单词的情况，我删除特定的组词。]

如何正确指定循环

编辑：在我建议的代码中，每个循环都是使用原始列重新启动的，而不是使用从以前的组中删除的单词在列上循环

解决方案建议更优雅、简洁，但如果单词是另一个单词的一部分，则将其删除（例如，单词“foo”正确地从“foo hello”中删除，但也错误地从“foobar”中删除）


# Input data:

data = {'A': ['summer time third grey abc', 'yellow sky hello table', 'fourth autumnwind'],
        'B': ['defg autumn times fourth table', 'not red skies second garnet', 'first blue chair winter']
}
df = pd.DataFrame (data, columns = ['A', 'B'])  

                            A                               B
0  summer time third grey abc  defg autumn times fourth table
1      yellow sky hello table     not red skies second garnet
2           fourth autumnwind         first blue chair winter

预期产出为：

         A_new              B_new
0     grey abc         defg table
1  hello table   no second garnet
2   autumnwind  blue chair winter

这需要Python3.7+才能工作（否则需要更多的代码）。根据您的关键字列表，我认为您正在尝试对多单词匹配进行优先级排序

dummy=0
def splitter(text):
    global dummy
    text=text.strip()
    if not text:
        return []
    for n,s in enumerate(stuff):
        for keyword in s:
            p=text.find(keyword)
            if p>=0:
                return splitter(text[:p])+[((dummy,keyword),n)]+splitter(text[p+len(keyword):])
    else:
        return [((dummy,text),-1)]

def remover(row):
    A=dict(splitter(row['A']))
    B=dict(splitter(row['B']))
    s=set(A.values()).intersection(set(B.values()))
    return [' '.join([k[1] for k,v in A.items() if v<0 or v not in s]),' '.join([k[1] for k,v in B.items() if v<0 or v not in s])]
pd.concat([df,pd.DataFrame(df.apply(remover, axis=1).to_list(), columns=['newA','newB'])],  axis=1)

dummy=0
def拆分器（文本）：
全局虚拟
text=text.strip（）
如果不是文本：
返回[]
对于枚举中的n，s（填充）：
对于s中的关键字：
p=文本。查找（关键字）
如果p>=0：
返回拆分器（文本[：p]）+[（虚拟，关键字），n）]+拆分器（文本[p+len（关键字）：]））
其他：
返回[（（虚拟，文本），-1）]
def去除器（世界其他地区）：
A=dict（拆分器（行['A']））
B=dict（拆分器（行['B']））
s=集合（A.值（））。交叉点（集合（B.值（）））
返回[''.join（[k[1]表示k，如果vimport re，则返回A.items（）中的v
展平列表=lambda l:[子列表中的子列表中的项对于子列表中的项]
def移除_递归（s，l）：
当len（l）>0时：
s=s.替换（l[0]，“”）
l=l[1:]
返回re.sub（r'\+'，''，s）.strip（）
df['A_new']=df.apply（lambda x:remove_recursive（x.A，展平列表（[l代表l，如果x.A中的e代表l，如果x.A中的e]）>0和len（[e代表e代表l，如果x.B中的e]）>0），轴=1）
df['B_new']=df.apply（lambda x:remove_recursive（x.B，展平_列表（[l代表l，如果x.A中的e代表l，如果x.A中的e代表l，如果x.B中的e代表l，如果x.A中的e代表l，如果x.B中的e代表l，则e代表l，如果x.A中的e代表e）），轴=1）
df.head（）
#A_新B_新
#0时间灰色abc defg表
#1张桌子不是第二张石榴石
#2风蓝色椅子

这是与注释中的代码类似的代码，使用递归lambda来匹配单词，并使用扁平列表来计算列表中两列中匹配的单词。
下面是原始问题中使用regex r'\b{}\b'的代码，已更正为在最新字符串而不是原始字符串上循环
# Groups of words to be removed:

colors = ['red skies', 'red sky', 'yellow sky', 'yellow skies', 'red', 'blue', 'black', 'yellow', 'green', 'grey']
seasons = ['summer times', 'summer time', 'autumn times', 'autumn time', 'spring', 'summer', 'winter', 'autumn']
numbers = ['first', 'second', 'third', 'fourth']

stuff = [colors, seasons, numbers]


df['A_new'] = df['A']
df['B_new'] = df['B']


def f_indicator(S,y):
    for word in listed:
        if re.search(r'\b' + re.escape(word) + r'\b', S):
            y = 1
    return y


def fRemove(S):
    for word in listed:
        if re.search(r'\b' + re.escape(word) + r'\b', S):
            S=re.sub(r'\b{}\b'.format(re.escape(word)), ' ', S)
    return S


for listed in stuff:

    df['A_Ind'] = 0
    df['B_Ind'] = 0

    df['A_Ind'] = df.apply(lambda x: f_indicator(x.A_new, x.A_Ind), axis=1)
    df['B_Ind'] = df.apply(lambda x: f_indicator(x.B_new, x.B_Ind), axis=1)

    df['inboth'] = 0
    df.loc[((df.A_Ind == 1) & (df.B_Ind == 1)), 'inboth'] = 1



    df.loc[df.inboth == 1, 'A_new'] = df.apply(lambda x: fRemove(x.A_new), axis=1)
    df.loc[df.inboth == 1, 'B_new'] = df.apply(lambda x: fRemove(x.B_new), axis=1)


    del df['inboth']
    del df['A_Ind']
    del df['B_Ind']

    
    df['A_new'] = df['A_new'].str.replace('\s{2,}', ' ')
    df['A_new'] = df['A_new'].str.strip()
    df['B_new'] = df['B_new'].str.replace('\s{2,}', ' ')
    df['B_new'] = df['B_new'].str.strip()

del df['A']
del df['B']
print(df)


输出：
         A_new              B_new
0     grey abc         defg table
1  hello table  not second garnet
2   autumnwind  blue chair winter

我想你想说“如果任何组中的任何一个单词出现在A列或B列”，因为按照你目前的措辞，你的预期输出是错误的。“Summer”只出现在A列，但你仍然删除了它……我会诚实地做一些类似的事情，但使用列表理解。[如果单词不在flatten中，则在x中逐字逐句]([l代表l代表l，如果（len（[e代表e代表e代表l，如果e代表x.A]）>0和len（[e代表e代表e代表l，如果e代表x.B]）>0）]其中展平是一个简单的展平操作符。不，预期的输出是正确的。在第一行，A（夏季第三灰色abc）和B（秋季第四表）中都有来自季节组的单词因此，由于A和B中都有季节词，所以A中的“夏季”应该去掉，B中的“秋季”应该去掉。同样，数字词“第三”和“第四”应该从第一行的A和B中去掉。这与第三行中的“蓝色”相比较没有从B中删除，因为A列中没有颜色词。这也是为什么我认为展平列表不起作用的原因。使用展平将删除所有位置的词。相反，只有当同一组中的词出现在A和B的同一行中时，我才需要删除同一组中的词。我可以假设你的词被单个空格分隔吗？它是对于这个@BingWang，使用字符串函数要比使用reThanks容易得多。不幸的是，我的列表已经合成了单词，所以解决方案实际上不起作用。对于列表中的组中的单词被单个空格分隔的错误信息，我深表歉意。我将编辑问题以澄清。@pandini我更新为使用递归变量dummy用于创建唯一的键，以防在单个短语中有重复的单词。列表理解非常好，并且确实会缩短code@BingWang如果你先计算理解的公共部分，你也可以进行优化，将搜索时间减少一半，我想我刚刚发现了另一个问题作为较长单词的一部分，它被错误地删除。（这就是为什么首先需要使用\b的正则表达式。我将用一个示例编辑此问题。）。
         A_new              B_new
0     grey abc         defg table
1  hello table  not second garnet
2   autumnwind  blue chair winter