Python 熊猫：为列上的重复添加指示器_Python_Pandas_Duplicates_Apply

Python 熊猫：为列上的重复添加指示器

python pandas

Python 熊猫：为列上的重复添加指示器,python,pandas,duplicates,apply,Python,Pandas,Duplicates,Apply,这是一个带有a、B、C、D列的表 A B C D 0 1 2 1.0 a 1 1 2 1.01 a 2 1 2 1.0 b 3 3 4 0 b 4 3 4 0 c 5 1 2 1 c 6 1 9 1 c 如何添加列以显示来自具有约束的其他行的重复项： A、B的精确匹配带C的浮动公差（0.05以内）不得与D匹配这远不漂亮，

这是一个带有a、B、C、D列的表

      A B  C    D   
    0 1 2 1.0   a   
    1 1 2 1.01  a   
    2 1 2 1.0   b   
    3 3 4 0     b   
    4 3 4 0     c   
    5 1 2 1     c   
    6 1 9 1     c

如何添加列以显示来自具有约束的其他行的重复项：

A、B的精确匹配
带C的浮动公差（0.05以内）
不得与D匹配

这远不漂亮，但它确实完成了任务：

tolerance=0.05

dups={}
for _, group in df.groupby(['A', 'B']):
    for i, row1 in group.iterrows():
        
        data = []
        
        for j, row2 in group.iterrows():
            if i!=j:
                if abs(row1['C'] - row2['C']) <= tolerance:
                    if row1['D'] != row2['D']:
                        print(i,j)
                        data.append(j)
        
        dups[i] = data
        
dups = [dups.get(a) for a in range(len(dups.keys()))]
df['dups'] = dups

df

    A   B   C       D   dups
0   1   2   1.00    a   [2, 5]
1   1   2   1.01    a   [2, 5]
2   1   2   1.00    b   [0, 1, 5]
3   3   4   0.00    b   [4]
4   3   4   0.00    c   [3]
5   1   2   1.00    c   [0, 1, 2]
6   1   9   1.00    c   []

公差=0.05
dups={}
对于u，df.groupby（['A'，'B']）中的组：
对于i，组中的第1行。ItErrors（）
数据=[]
对于j，在group.iterrows（）中的第2行：
如果我=j:
如果abs（第1行['C']-第2行['C']]）转换为字典：
res = df.T.to_dict("list")
res

{0: [1, 2, 1.0, 'a'],
 1: [1, 2, 1.01, 'a'],
 2: [1, 2, 1.0, 'b'],
 3: [3, 4, 0.0, 'b'],
 4: [3, 4, 0.0, 'c'],
 5: [1, 2, 1.0, 'c'],
 6: [1, 9, 1.0, 'c']}

将索引和值配对到每个子列表中：
box = [(key,*value) for key, value in res.items()]
box

[(0, 1, 2, 1.0, 'a'),
 (1, 1, 2, 1.01, 'a'),
 (2, 1, 2, 1.0, 'b'),
 (3, 3, 4, 0.0, 'b'),
 (4, 3, 4, 0.0, 'c'),
 (5, 1, 2, 1.0, 'c'),
 (6, 1, 9, 1.0, 'c')]

与您的条件一起使用以筛选出匹配项：
from itertools import permutations
phase1 = [(ind, (first, second),*_) for ind, first, second, *_ in box]

#can be refactored with something cleaner
phase2 = [((*first[1],*first[2:]), second[0]) 
          for first, second in permutations(phase1,2) 
          if first[1] == second[1] and second[2] - first[2] <= 0.05 and first[-1] != second[-1]
         ]
phase2

[((1, 2, 1.0, 'a'), 2),
 ((1, 2, 1.0, 'a'), 5),
 ((1, 2, 1.01, 'a'), 2),
 ((1, 2, 1.01, 'a'), 5),
 ((1, 2, 1.0, 'b'), 0),
 ((1, 2, 1.0, 'b'), 1),
 ((1, 2, 1.0, 'b'), 5),
 ((3, 4, 0.0, 'b'), 4),
 ((3, 4, 0.0, 'c'), 3),
 ((1, 2, 1.0, 'c'), 0),
 ((1, 2, 1.0, 'c'), 1),
 ((1, 2, 1.0, 'c'), 2)]

将d
中的值组合为字符串：
e = [(*k,",".join(str(ent) for ent in v)) for k,v in d.items()]
e

[(1, 2, 1.0, 'a', '2,5'),
 (1, 2, 1.01, 'a', '2,5'),
 (1, 2, 1.0, 'b', '0,1,5'),
 (3, 4, 0.0, 'b', '4'),
 (3, 4, 0.0, 'c', '3'),
 (1, 2, 1.0, 'c', '0,1,2')]

从提取创建数据帧：
cols = df.columns.append(pd.Index(["Dups"]))
dups = pd.DataFrame(e, columns=cols)

与原始数据帧合并：
result = df.merge(dups, how="left", on=["A", "B", "C", "D"])
result

    A   B   C       D   Dups
0   1   2   1.00    a   2,5
1   1   2   1.01    a   2,5
2   1   2   1.00    b   0,1,5
3   3   4   0.00    b   4
4   3   4   0.00    c   3
5   1   2   1.00    c   0,1,2
6   1   9   1.00    c   NaN

我的原始答案要求N**2
对N
行进行迭代。sammywemmy的答案在置换（…，2）
上循环，本质上是在N*（N-1）
组合上循环。warped的答案更有效，因为它首先在a列和B列上进行更快的匹配，但在C列和D列上搜索条件的速度仍然较慢。因此，迭代次数为N*M
，其中M
是共享相同A和B值的平均行数
如果您愿意将“C等于+/-0.05”的要求更改为“四舍五入到1位小数时C等于”，则使用N*K
迭代会更好，其中K
是具有相同A、B和C值的平均行数。这里是一个实现；您还可以调整warped的方法
df = pd.DataFrame(
    {'A': {0: 1, 1: 1, 2: 1, 3: 3, 4: 3, 5: 1, 6: 1},
     'B': {0: 2, 1: 2, 2: 2, 3: 4, 4: 4, 5: 2, 6: 9},
     'C': {0: 1.0, 1: 1.01, 2: 1.0, 3: 0.0, 4: 0.0, 5: 1.0, 6: 1.0},
     'D': {0: 'a', 1: 'a', 2: 'b', 3: 'b', 4: 'c', 5: 'c', 6: 'c'}})

# alternative to "equal +/- 0.05"
df['C10'] = np.around(df['C']*10).astype('int')

# convert int64 tuples to int tuples
ituple = lambda tup: tuple(int(x) for x in tup)

# records : [(1, 2, 10), (1, 2, 100, (1, 2, 10), (3, 4,0), ...]
records = [ituple(rec) for rec in df[['A', 'B', 'C10']].to_records(index=False)]

# dupd: dict with records as key, list of indices as values.
# e.g. {(1, 2, 10): [0, 1, 2, 5], ...}
dupd = {} # key: ABC tuples; value: list of indices

# Build up dupd based on equal A, B, C columns.
for i, rec in enumerate(records):
    # each record is a tuple with integers; can be used as key in dict
    if rec in dupd:
        dupd[rec].append(i)
    else:
        dupd[rec] = [i]
        
# build duplicates for each row, remove the ones with equal D
dups = []
D = df['D']
for i, rec in enumerate(records):
    dup = [j for j in dupd[rec] if i!=j and D[i] != D[j]]
    dups.append(tuple(dup))
    
df.drop(columns=['C10'], inplace=True)
df['Dups'] = dups
        
print(df)

输出：
   A  B     C  D       Dups
0  1  2  1.00  a     (2, 5)
1  1  2  1.01  a     (2, 5)
2  1  2  1.00  b  (0, 1, 5)
3  3  4  0.00  b       (4,)
4  3  4  0.00  c       (3,)
5  1  2  1.00  c  (0, 1, 2)
6  1  9  1.00  c         ()

以下是原始答案，其比例为O（N**2），但很容易理解：
import pandas as pd
import numpy as np

df = pd.DataFrame(
    {'A': {0: 1, 1: 1, 2: 1, 3: 3, 4: 3, 5: 1, 6: 1},
     'B': {0: 2, 1: 2, 2: 2, 3: 4, 4: 4, 5: 2, 6: 9},
     'C': {0: 1.0, 1: 1.01, 2: 1.0, 3: 0.0, 4: 0.0, 5: 1.0, 6: 1.0},
     'D': {0: 'a', 1: 'a', 2: 'b', 3: 'b', 4: 'c', 5: 'c', 6: 'c'}})


dups = []
for i, irow in df.iterrows():
    dup = []
    for j, jrow in df.iterrows():
        if (i != j and 
            irow['A'] == jrow['A'] and
            irow['B'] == jrow['B'] and 
            abs(irow['C']-jrow['C']) < 0.05 and
            irow['D'] != jrow['D']
            ):
            dup.append(j)
    dups.append(tuple(dup))
df['Dups'] = dups

print(df)

将熊猫作为pd导入
将numpy作为np导入
df=pd.DataFrame(
{'A'：{0:1,1:1,2:1,3:3,4:3,5:1,6:1}，
‘B’：{0:2，1:2，2:2，3:4，4:4，5:2，6:9}，
‘C’：{0:1.0，1:1.01，2:1.0，3:0.0，4:0.0，5:1.0，6:1.0}，
'D'：{0:a'，1:a'，2:b'，3:b'，4:c'，5:c'，6:c'}）
dups=[]
对于i，在df.iterrows（）中显示：
dup=[]
对于j，jrow在df.iterrows（）中：
如果（i！=j和
irow['A']==jrow['A']和
irow['B']==jrow['B']和
abs（irow['C']-jrow['C']）<0.05和
irow['D']！=jrow['D']
):
重复附加（j）
追加（元组（dup））
df['Dups']=Dups
打印（df）
我不介意效率低下-只是想添加一个专栏来帮助一些研究。。。每周生成几次数据，可能需要一段时间才能生成。不太熟悉熊猫-谢谢！
   A  B     C  D       Dups
0  1  2  1.00  a     (2, 5)
1  1  2  1.01  a     (2, 5)
2  1  2  1.00  b  (0, 1, 5)
3  3  4  0.00  b       (4,)
4  3  4  0.00  c       (3,)
5  1  2  1.00  c  (0, 1, 2)
6  1  9  1.00  c         ()

import pandas as pd
import numpy as np

df = pd.DataFrame(
    {'A': {0: 1, 1: 1, 2: 1, 3: 3, 4: 3, 5: 1, 6: 1},
     'B': {0: 2, 1: 2, 2: 2, 3: 4, 4: 4, 5: 2, 6: 9},
     'C': {0: 1.0, 1: 1.01, 2: 1.0, 3: 0.0, 4: 0.0, 5: 1.0, 6: 1.0},
     'D': {0: 'a', 1: 'a', 2: 'b', 3: 'b', 4: 'c', 5: 'c', 6: 'c'}})


dups = []
for i, irow in df.iterrows():
    dup = []
    for j, jrow in df.iterrows():
        if (i != j and 
            irow['A'] == jrow['A'] and
            irow['B'] == jrow['B'] and 
            abs(irow['C']-jrow['C']) < 0.05 and
            irow['D'] != jrow['D']
            ):
            dup.append(j)
    dups.append(tuple(dup))
df['Dups'] = dups

print(df)