Python 熊猫:为列上的重复添加指示器
这是一个带有a、B、C、D列的表Python 熊猫:为列上的重复添加指示器,python,pandas,duplicates,apply,Python,Pandas,Duplicates,Apply,这是一个带有a、B、C、D列的表 A B C D 0 1 2 1.0 a 1 1 2 1.01 a 2 1 2 1.0 b 3 3 4 0 b 4 3 4 0 c 5 1 2 1 c 6 1 9 1 c 如何添加列以显示来自具有约束的其他行的重复项: A、B的精确匹配 带C的浮动公差(0.05以内) 不得与D匹配 这远不漂亮,
A B C D
0 1 2 1.0 a
1 1 2 1.01 a
2 1 2 1.0 b
3 3 4 0 b
4 3 4 0 c
5 1 2 1 c
6 1 9 1 c
如何添加列以显示来自具有约束的其他行的重复项:
- A、B的精确匹配
- 带C的浮动公差(0.05以内)
- 不得与D匹配
这远不漂亮,但它确实完成了任务:
tolerance=0.05
dups={}
for _, group in df.groupby(['A', 'B']):
for i, row1 in group.iterrows():
data = []
for j, row2 in group.iterrows():
if i!=j:
if abs(row1['C'] - row2['C']) <= tolerance:
if row1['D'] != row2['D']:
print(i,j)
data.append(j)
dups[i] = data
dups = [dups.get(a) for a in range(len(dups.keys()))]
df['dups'] = dups
df
A B C D dups
0 1 2 1.00 a [2, 5]
1 1 2 1.01 a [2, 5]
2 1 2 1.00 b [0, 1, 5]
3 3 4 0.00 b [4]
4 3 4 0.00 c [3]
5 1 2 1.00 c [0, 1, 2]
6 1 9 1.00 c []
公差=0.05
dups={}
对于u,df.groupby(['A','B'])中的组:
对于i,组中的第1行。ItErrors()
数据=[]
对于j,在group.iterrows()中的第2行:
如果我=j:
如果abs(第1行['C']-第2行['C']])转换为字典:
res = df.T.to_dict("list")
res
{0: [1, 2, 1.0, 'a'],
1: [1, 2, 1.01, 'a'],
2: [1, 2, 1.0, 'b'],
3: [3, 4, 0.0, 'b'],
4: [3, 4, 0.0, 'c'],
5: [1, 2, 1.0, 'c'],
6: [1, 9, 1.0, 'c']}
将索引和值配对到每个子列表中:
box = [(key,*value) for key, value in res.items()]
box
[(0, 1, 2, 1.0, 'a'),
(1, 1, 2, 1.01, 'a'),
(2, 1, 2, 1.0, 'b'),
(3, 3, 4, 0.0, 'b'),
(4, 3, 4, 0.0, 'c'),
(5, 1, 2, 1.0, 'c'),
(6, 1, 9, 1.0, 'c')]
与您的条件一起使用以筛选出匹配项:
from itertools import permutations
phase1 = [(ind, (first, second),*_) for ind, first, second, *_ in box]
#can be refactored with something cleaner
phase2 = [((*first[1],*first[2:]), second[0])
for first, second in permutations(phase1,2)
if first[1] == second[1] and second[2] - first[2] <= 0.05 and first[-1] != second[-1]
]
phase2
[((1, 2, 1.0, 'a'), 2),
((1, 2, 1.0, 'a'), 5),
((1, 2, 1.01, 'a'), 2),
((1, 2, 1.01, 'a'), 5),
((1, 2, 1.0, 'b'), 0),
((1, 2, 1.0, 'b'), 1),
((1, 2, 1.0, 'b'), 5),
((3, 4, 0.0, 'b'), 4),
((3, 4, 0.0, 'c'), 3),
((1, 2, 1.0, 'c'), 0),
((1, 2, 1.0, 'c'), 1),
((1, 2, 1.0, 'c'), 2)]
将d
中的值组合为字符串:
e = [(*k,",".join(str(ent) for ent in v)) for k,v in d.items()]
e
[(1, 2, 1.0, 'a', '2,5'),
(1, 2, 1.01, 'a', '2,5'),
(1, 2, 1.0, 'b', '0,1,5'),
(3, 4, 0.0, 'b', '4'),
(3, 4, 0.0, 'c', '3'),
(1, 2, 1.0, 'c', '0,1,2')]
从提取创建数据帧:
cols = df.columns.append(pd.Index(["Dups"]))
dups = pd.DataFrame(e, columns=cols)
与原始数据帧合并:
result = df.merge(dups, how="left", on=["A", "B", "C", "D"])
result
A B C D Dups
0 1 2 1.00 a 2,5
1 1 2 1.01 a 2,5
2 1 2 1.00 b 0,1,5
3 3 4 0.00 b 4
4 3 4 0.00 c 3
5 1 2 1.00 c 0,1,2
6 1 9 1.00 c NaN
我的原始答案要求N**2
对N
行进行迭代。sammywemmy的答案在置换(…,2)
上循环,本质上是在N*(N-1)
组合上循环。warped的答案更有效,因为它首先在a列和B列上进行更快的匹配,但在C列和D列上搜索条件的速度仍然较慢。因此,迭代次数为N*M
,其中M
是共享相同A和B值的平均行数
如果您愿意将“C等于+/-0.05”的要求更改为“四舍五入到1位小数时C等于”,则使用N*K
迭代会更好,其中K
是具有相同A、B和C值的平均行数。这里是一个实现;您还可以调整warped的方法
df = pd.DataFrame(
{'A': {0: 1, 1: 1, 2: 1, 3: 3, 4: 3, 5: 1, 6: 1},
'B': {0: 2, 1: 2, 2: 2, 3: 4, 4: 4, 5: 2, 6: 9},
'C': {0: 1.0, 1: 1.01, 2: 1.0, 3: 0.0, 4: 0.0, 5: 1.0, 6: 1.0},
'D': {0: 'a', 1: 'a', 2: 'b', 3: 'b', 4: 'c', 5: 'c', 6: 'c'}})
# alternative to "equal +/- 0.05"
df['C10'] = np.around(df['C']*10).astype('int')
# convert int64 tuples to int tuples
ituple = lambda tup: tuple(int(x) for x in tup)
# records : [(1, 2, 10), (1, 2, 100, (1, 2, 10), (3, 4,0), ...]
records = [ituple(rec) for rec in df[['A', 'B', 'C10']].to_records(index=False)]
# dupd: dict with records as key, list of indices as values.
# e.g. {(1, 2, 10): [0, 1, 2, 5], ...}
dupd = {} # key: ABC tuples; value: list of indices
# Build up dupd based on equal A, B, C columns.
for i, rec in enumerate(records):
# each record is a tuple with integers; can be used as key in dict
if rec in dupd:
dupd[rec].append(i)
else:
dupd[rec] = [i]
# build duplicates for each row, remove the ones with equal D
dups = []
D = df['D']
for i, rec in enumerate(records):
dup = [j for j in dupd[rec] if i!=j and D[i] != D[j]]
dups.append(tuple(dup))
df.drop(columns=['C10'], inplace=True)
df['Dups'] = dups
print(df)
输出:
A B C D Dups
0 1 2 1.00 a (2, 5)
1 1 2 1.01 a (2, 5)
2 1 2 1.00 b (0, 1, 5)
3 3 4 0.00 b (4,)
4 3 4 0.00 c (3,)
5 1 2 1.00 c (0, 1, 2)
6 1 9 1.00 c ()
以下是原始答案,其比例为O(N**2),但很容易理解:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{'A': {0: 1, 1: 1, 2: 1, 3: 3, 4: 3, 5: 1, 6: 1},
'B': {0: 2, 1: 2, 2: 2, 3: 4, 4: 4, 5: 2, 6: 9},
'C': {0: 1.0, 1: 1.01, 2: 1.0, 3: 0.0, 4: 0.0, 5: 1.0, 6: 1.0},
'D': {0: 'a', 1: 'a', 2: 'b', 3: 'b', 4: 'c', 5: 'c', 6: 'c'}})
dups = []
for i, irow in df.iterrows():
dup = []
for j, jrow in df.iterrows():
if (i != j and
irow['A'] == jrow['A'] and
irow['B'] == jrow['B'] and
abs(irow['C']-jrow['C']) < 0.05 and
irow['D'] != jrow['D']
):
dup.append(j)
dups.append(tuple(dup))
df['Dups'] = dups
print(df)
将熊猫作为pd导入
将numpy作为np导入
df=pd.DataFrame(
{'A':{0:1,1:1,2:1,3:3,4:3,5:1,6:1},
‘B’:{0:2,1:2,2:2,3:4,4:4,5:2,6:9},
‘C’:{0:1.0,1:1.01,2:1.0,3:0.0,4:0.0,5:1.0,6:1.0},
'D':{0:a',1:a',2:b',3:b',4:c',5:c',6:c'})
dups=[]
对于i,在df.iterrows()中显示:
dup=[]
对于j,jrow在df.iterrows()中:
如果(i!=j和
irow['A']==jrow['A']和
irow['B']==jrow['B']和
abs(irow['C']-jrow['C'])<0.05和
irow['D']!=jrow['D']
):
重复附加(j)
追加(元组(dup))
df['Dups']=Dups
打印(df)
我不介意效率低下-只是想添加一个专栏来帮助一些研究。。。每周生成几次数据,可能需要一段时间才能生成。不太熟悉熊猫-谢谢!
A B C D Dups
0 1 2 1.00 a (2, 5)
1 1 2 1.01 a (2, 5)
2 1 2 1.00 b (0, 1, 5)
3 3 4 0.00 b (4,)
4 3 4 0.00 c (3,)
5 1 2 1.00 c (0, 1, 2)
6 1 9 1.00 c ()
import pandas as pd
import numpy as np
df = pd.DataFrame(
{'A': {0: 1, 1: 1, 2: 1, 3: 3, 4: 3, 5: 1, 6: 1},
'B': {0: 2, 1: 2, 2: 2, 3: 4, 4: 4, 5: 2, 6: 9},
'C': {0: 1.0, 1: 1.01, 2: 1.0, 3: 0.0, 4: 0.0, 5: 1.0, 6: 1.0},
'D': {0: 'a', 1: 'a', 2: 'b', 3: 'b', 4: 'c', 5: 'c', 6: 'c'}})
dups = []
for i, irow in df.iterrows():
dup = []
for j, jrow in df.iterrows():
if (i != j and
irow['A'] == jrow['A'] and
irow['B'] == jrow['B'] and
abs(irow['C']-jrow['C']) < 0.05 and
irow['D'] != jrow['D']
):
dup.append(j)
dups.append(tuple(dup))
df['Dups'] = dups
print(df)