Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/322.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 熊猫:为列上的重复添加指示器_Python_Pandas_Duplicates_Apply - Fatal编程技术网

Python 熊猫:为列上的重复添加指示器

Python 熊猫:为列上的重复添加指示器,python,pandas,duplicates,apply,Python,Pandas,Duplicates,Apply,这是一个带有a、B、C、D列的表 A B C D 0 1 2 1.0 a 1 1 2 1.01 a 2 1 2 1.0 b 3 3 4 0 b 4 3 4 0 c 5 1 2 1 c 6 1 9 1 c 如何添加列以显示来自具有约束的其他行的重复项: A、B的精确匹配 带C的浮动公差(0.05以内) 不得与D匹配 这远不漂亮,

这是一个带有a、B、C、D列的表

      A B  C    D   
    0 1 2 1.0   a   
    1 1 2 1.01  a   
    2 1 2 1.0   b   
    3 3 4 0     b   
    4 3 4 0     c   
    5 1 2 1     c   
    6 1 9 1     c   
如何添加列以显示来自具有约束的其他行的重复项:

  • A、B的精确匹配
  • 带C的浮动公差(0.05以内)
  • 不得与D匹配

这远不漂亮,但它确实完成了任务:

tolerance=0.05

dups={}
for _, group in df.groupby(['A', 'B']):
    for i, row1 in group.iterrows():
        
        data = []
        
        for j, row2 in group.iterrows():
            if i!=j:
                if abs(row1['C'] - row2['C']) <= tolerance:
                    if row1['D'] != row2['D']:
                        print(i,j)
                        data.append(j)
        
        dups[i] = data
        
dups = [dups.get(a) for a in range(len(dups.keys()))]
df['dups'] = dups

df

    A   B   C       D   dups
0   1   2   1.00    a   [2, 5]
1   1   2   1.01    a   [2, 5]
2   1   2   1.00    b   [0, 1, 5]
3   3   4   0.00    b   [4]
4   3   4   0.00    c   [3]
5   1   2   1.00    c   [0, 1, 2]
6   1   9   1.00    c   []
公差=0.05
dups={}
对于u,df.groupby(['A','B'])中的组:
对于i,组中的第1行。ItErrors()
数据=[]
对于j,在group.iterrows()中的第2行:
如果我=j:

如果abs(第1行['C']-第2行['C']])转换为字典:

res = df.T.to_dict("list")
res

{0: [1, 2, 1.0, 'a'],
 1: [1, 2, 1.01, 'a'],
 2: [1, 2, 1.0, 'b'],
 3: [3, 4, 0.0, 'b'],
 4: [3, 4, 0.0, 'c'],
 5: [1, 2, 1.0, 'c'],
 6: [1, 9, 1.0, 'c']}
将索引和值配对到每个子列表中:

box = [(key,*value) for key, value in res.items()]
box

[(0, 1, 2, 1.0, 'a'),
 (1, 1, 2, 1.01, 'a'),
 (2, 1, 2, 1.0, 'b'),
 (3, 3, 4, 0.0, 'b'),
 (4, 3, 4, 0.0, 'c'),
 (5, 1, 2, 1.0, 'c'),
 (6, 1, 9, 1.0, 'c')]
与您的条件一起使用以筛选出匹配项:

from itertools import permutations
phase1 = [(ind, (first, second),*_) for ind, first, second, *_ in box]

#can be refactored with something cleaner
phase2 = [((*first[1],*first[2:]), second[0]) 
          for first, second in permutations(phase1,2) 
          if first[1] == second[1] and second[2] - first[2] <= 0.05 and first[-1] != second[-1]
         ]
phase2

[((1, 2, 1.0, 'a'), 2),
 ((1, 2, 1.0, 'a'), 5),
 ((1, 2, 1.01, 'a'), 2),
 ((1, 2, 1.01, 'a'), 5),
 ((1, 2, 1.0, 'b'), 0),
 ((1, 2, 1.0, 'b'), 1),
 ((1, 2, 1.0, 'b'), 5),
 ((3, 4, 0.0, 'b'), 4),
 ((3, 4, 0.0, 'c'), 3),
 ((1, 2, 1.0, 'c'), 0),
 ((1, 2, 1.0, 'c'), 1),
 ((1, 2, 1.0, 'c'), 2)]
d
中的值组合为字符串:

e = [(*k,",".join(str(ent) for ent in v)) for k,v in d.items()]
e

[(1, 2, 1.0, 'a', '2,5'),
 (1, 2, 1.01, 'a', '2,5'),
 (1, 2, 1.0, 'b', '0,1,5'),
 (3, 4, 0.0, 'b', '4'),
 (3, 4, 0.0, 'c', '3'),
 (1, 2, 1.0, 'c', '0,1,2')]
从提取创建数据帧:

cols = df.columns.append(pd.Index(["Dups"]))
dups = pd.DataFrame(e, columns=cols)
与原始数据帧合并:

result = df.merge(dups, how="left", on=["A", "B", "C", "D"])
result

    A   B   C       D   Dups
0   1   2   1.00    a   2,5
1   1   2   1.01    a   2,5
2   1   2   1.00    b   0,1,5
3   3   4   0.00    b   4
4   3   4   0.00    c   3
5   1   2   1.00    c   0,1,2
6   1   9   1.00    c   NaN

我的原始答案要求
N**2
N
行进行迭代。sammywemmy的答案在
置换(…,2)
上循环,本质上是在
N*(N-1)
组合上循环。warped的答案更有效,因为它首先在a列和B列上进行更快的匹配,但在C列和D列上搜索条件的速度仍然较慢。因此,迭代次数为
N*M
,其中
M
是共享相同A和B值的平均行数

如果您愿意将“C等于+/-0.05”的要求更改为“四舍五入到1位小数时C等于”,则使用
N*K
迭代会更好,其中
K
是具有相同A、B和C值的平均行数。这里是一个实现;您还可以调整warped的方法

df = pd.DataFrame(
    {'A': {0: 1, 1: 1, 2: 1, 3: 3, 4: 3, 5: 1, 6: 1},
     'B': {0: 2, 1: 2, 2: 2, 3: 4, 4: 4, 5: 2, 6: 9},
     'C': {0: 1.0, 1: 1.01, 2: 1.0, 3: 0.0, 4: 0.0, 5: 1.0, 6: 1.0},
     'D': {0: 'a', 1: 'a', 2: 'b', 3: 'b', 4: 'c', 5: 'c', 6: 'c'}})

# alternative to "equal +/- 0.05"
df['C10'] = np.around(df['C']*10).astype('int')

# convert int64 tuples to int tuples
ituple = lambda tup: tuple(int(x) for x in tup)

# records : [(1, 2, 10), (1, 2, 100, (1, 2, 10), (3, 4,0), ...]
records = [ituple(rec) for rec in df[['A', 'B', 'C10']].to_records(index=False)]

# dupd: dict with records as key, list of indices as values.
# e.g. {(1, 2, 10): [0, 1, 2, 5], ...}
dupd = {} # key: ABC tuples; value: list of indices

# Build up dupd based on equal A, B, C columns.
for i, rec in enumerate(records):
    # each record is a tuple with integers; can be used as key in dict
    if rec in dupd:
        dupd[rec].append(i)
    else:
        dupd[rec] = [i]
        
# build duplicates for each row, remove the ones with equal D
dups = []
D = df['D']
for i, rec in enumerate(records):
    dup = [j for j in dupd[rec] if i!=j and D[i] != D[j]]
    dups.append(tuple(dup))
    
df.drop(columns=['C10'], inplace=True)
df['Dups'] = dups
        
print(df)
输出:

   A  B     C  D       Dups
0  1  2  1.00  a     (2, 5)
1  1  2  1.01  a     (2, 5)
2  1  2  1.00  b  (0, 1, 5)
3  3  4  0.00  b       (4,)
4  3  4  0.00  c       (3,)
5  1  2  1.00  c  (0, 1, 2)
6  1  9  1.00  c         ()
以下是原始答案,其比例为O(N**2),但很容易理解:

import pandas as pd
import numpy as np

df = pd.DataFrame(
    {'A': {0: 1, 1: 1, 2: 1, 3: 3, 4: 3, 5: 1, 6: 1},
     'B': {0: 2, 1: 2, 2: 2, 3: 4, 4: 4, 5: 2, 6: 9},
     'C': {0: 1.0, 1: 1.01, 2: 1.0, 3: 0.0, 4: 0.0, 5: 1.0, 6: 1.0},
     'D': {0: 'a', 1: 'a', 2: 'b', 3: 'b', 4: 'c', 5: 'c', 6: 'c'}})


dups = []
for i, irow in df.iterrows():
    dup = []
    for j, jrow in df.iterrows():
        if (i != j and 
            irow['A'] == jrow['A'] and
            irow['B'] == jrow['B'] and 
            abs(irow['C']-jrow['C']) < 0.05 and
            irow['D'] != jrow['D']
            ):
            dup.append(j)
    dups.append(tuple(dup))
df['Dups'] = dups

print(df)
将熊猫作为pd导入
将numpy作为np导入
df=pd.DataFrame(
{'A':{0:1,1:1,2:1,3:3,4:3,5:1,6:1},
‘B’:{0:2,1:2,2:2,3:4,4:4,5:2,6:9},
‘C’:{0:1.0,1:1.01,2:1.0,3:0.0,4:0.0,5:1.0,6:1.0},
'D':{0:a',1:a',2:b',3:b',4:c',5:c',6:c'})
dups=[]
对于i,在df.iterrows()中显示:
dup=[]
对于j,jrow在df.iterrows()中:
如果(i!=j和
irow['A']==jrow['A']和
irow['B']==jrow['B']和
abs(irow['C']-jrow['C'])<0.05和
irow['D']!=jrow['D']
):
重复附加(j)
追加(元组(dup))
df['Dups']=Dups
打印(df)

我不介意效率低下-只是想添加一个专栏来帮助一些研究。。。每周生成几次数据,可能需要一段时间才能生成。不太熟悉熊猫-谢谢!
   A  B     C  D       Dups
0  1  2  1.00  a     (2, 5)
1  1  2  1.01  a     (2, 5)
2  1  2  1.00  b  (0, 1, 5)
3  3  4  0.00  b       (4,)
4  3  4  0.00  c       (3,)
5  1  2  1.00  c  (0, 1, 2)
6  1  9  1.00  c         ()
import pandas as pd
import numpy as np

df = pd.DataFrame(
    {'A': {0: 1, 1: 1, 2: 1, 3: 3, 4: 3, 5: 1, 6: 1},
     'B': {0: 2, 1: 2, 2: 2, 3: 4, 4: 4, 5: 2, 6: 9},
     'C': {0: 1.0, 1: 1.01, 2: 1.0, 3: 0.0, 4: 0.0, 5: 1.0, 6: 1.0},
     'D': {0: 'a', 1: 'a', 2: 'b', 3: 'b', 4: 'c', 5: 'c', 6: 'c'}})


dups = []
for i, irow in df.iterrows():
    dup = []
    for j, jrow in df.iterrows():
        if (i != j and 
            irow['A'] == jrow['A'] and
            irow['B'] == jrow['B'] and 
            abs(irow['C']-jrow['C']) < 0.05 and
            irow['D'] != jrow['D']
            ):
            dup.append(j)
    dups.append(tuple(dup))
df['Dups'] = dups

print(df)