Python 如何基于列值删除某些行'；s列的值是另一列的子集吗？_Python_Python 3.x_Pandas

Python 如何基于列值删除某些行'；s列的值是另一列的子集吗？

python python-3.x pandas

Python 如何基于列值删除某些行'；s列的值是另一列的子集吗？,python,python-3.x,pandas,Python,Python 3.x,Pandas,假设我有一个dataframedf作为：- index company url address 0 A . www.abc.contact.com 16D Bayberry Rd, New Bedford, MA, 02740, USA 1 A . www.abc.contact.com . MA, USA 2 A . www.abc.about.com .

假设我有一个

dataframe

df作为：-

index company  url                          address 
 0     A .    www.abc.contact.com         16D Bayberry Rd, New Bedford, MA, 02740, USA
 1     A .    www.abc.contact.com .       MA, USA
 2     A .    www.abc.about.com .         USA
 3     B .    www.pqr.com .               New Bedford, MA, USA
 4     B.     www.pqr.com/about .         MA, USA

我想从

数据框

中删除所有行，其中

地址

是另一个地址的子集，公司是相同的。我要以上五行中的这两行

index  company  url                          address 
 0     A .    www.abc.contact.com         16D Bayberry Rd, New Bedford, MA, 02740, USA
 3     B .    www.pqr.com .               New Bedford, MA, USA

也许这不是一个最佳解决方案，但它在这个小数据帧上完成了工作：

编辑添加了对公司名称的检查，假设我们删除了标点符号

df = pd.DataFrame({"company": ['A', 'A', 'A', 'B', 'B'],
                   "address": ['16D Bayberry Rd, New Bedford, MA, 02740, USA',
                               'MA, USA',
                               'USA',
                               'New Bedford, MA, USA',
                               'MA, USA']})
# Splitting addresses by column and making sets from every address to use "issubset" later
addresses = list(df['address'].apply(lambda x: set(x.split(', '))).values)
companies = list(df['company'].values)

rows_to_drop = []  # Storing row indexes to drop here
# Iterating by every address
for i, (address, company) in enumerate(zip(addresses, companies)):
    # Iteraing by the remaining addresses
    rem_addr = addresses[:i] + addresses[(i + 1):]
    rem_comp = companies[:i] + companies[(i + 1):]

    for other_addr, other_comp in zip(rem_addr, rem_comp):
        # If address is a subset of another address, add it to drop
        if address.issubset(other_addr) and company == other_comp:
            rows_to_drop.append(i)
            break

df = df.drop(rows_to_drop)
print(df)

company address
0   A   16D Bayberry Rd, New Bedford, MA, 02740, USA
3   B   New Bedford, MA, USA

什么定义了子集？因为字符串

'MA，USA'

不是

company='a'

中任何内容的子字符串。第一行确实分别包含这两个单词，但您是否希望地址的每一部分都用逗号分隔并分别检查它们？@ALollz by

subset

，我的意思是在删除标点后，我们应该得到包含列出的所有其他地址的字符串地址（如字符串子集匹配）.@Harry_pb这不是一件小事。运行它可能会很耗时，因为您必须删除标点符号，然后拆分字符串，然后检查其所有子字符串是否都存在于公司的

address

列中。每行重复一次。太疯狂了！你能简化一下吗？这个解决方案不考虑公司。这就是为什么您成功地放弃了A公司中的“MA，USA”，尽管它不是其中任何地址的子字符串（字面意思）。在这种情况下，需要进行更复杂的搜索（例如基于

拆分。@teoretic感谢您的回复，我在这里面临的问题不仅是获取唯一的地址列表，而且还要管理与地址对齐的公司和url，因为我有非常大的数据集，其中地址也会重复给其他公司，我编辑了我的答案，希望这能有所帮助！谢谢，经过大规模的修改，它解决了我的问题。干杯很高兴我能帮忙！谢谢你的回答！