Python 如何找到多个数据帧中一对列与任意顺序的对的交点?
我有多个数据帧,为了简单起见,假设我有三个Python 如何找到多个数据帧中一对列与任意顺序的对的交点?,python,python-3.x,pandas,dataframe,Python,Python 3.x,Pandas,Dataframe,我有多个数据帧,为了简单起见,假设我有三个 >> df1= col1 col2 id1 A B id2 C D id3 B A id4 E F >> df2= col1 col2 id1 B A id2 D C id3 M N id4 F E >&g
>> df1=
col1 col2
id1 A B
id2 C D
id3 B A
id4 E F
>> df2=
col1 col2
id1 B A
id2 D C
id3 M N
id4 F E
>> df3=
col1 col2
id1 A B
id2 D C
id3 N M
id4 E F
所需的结果是:
>> df=
col1 col2
id1 A B
id2 C D
id3 E F
因为(A,B),(C,D),(E,F)对出现在所有数据帧中,尽管它可以反转
在使用pandas merge时,它只考虑传递列的方式。为了检查我的观察结果,我对两个数据帧尝试了以下代码:
df1['reverse_1'] = (df1.col1+df1.col2).isin(df2.col1 + df2.col2)
df1['reverse_2'] = (df1.col1+df1.col2).isin(df2.col2 + df2.col1)
我发现结果不同:
col1 col2 reverse_1 reverse_2
a b False True
c d False True
b a True False
e f False True
因此,如果我从reverse_1和reverse_2列中收集“True”值,我可以得到两个数据帧的交集。即使我对两个数据帧这样做,我也不清楚如何处理更多的数据帧(多于两个)。对此我有点困惑。有什么建议吗?您可以创建
数据框的列表,并按行进行列表理解排序,删除重复项:
dfs = [df1,df2,df3]
L = [pd.DataFrame(np.sort(x.values, axis=1), columns=x.columns).drop_duplicates()
for x in dfs]
print (L)
[ col1 col2
0 A B
1 C D
3 E F, col1 col2
0 A B
1 C D
2 M N
3 E F, col1 col2
0 A B
1 C D
2 M N
3 E F]
然后通过所有列(在
上没有参数):
@pygo的另一个解决方案:
创建index
byfrozenset
s并通过与internal
join连接在一起,最后通过索引删除重复项,通过和获取前两列:
df = pd.concat([x.set_index(x.apply(frozenset, axis=1)) for x in dfs], axis=1, join='inner')
df = df.iloc[~df.index.duplicated(), :2]
print (df)
col1 col2
(B, A) A B
(C, D) C D
(F, E) E F
与前面的一些答案有些相似
import pandas as pd
from io import StringIO
# Test data
df1 = pd.read_table(StringIO ("""
id col1 col2
id1 A B
id2 C D
id3 B A
id4 E F
"""), delim_whitespace = True)
df2 = pd.read_table(StringIO ("""
id col1 col2
id1 B A
id2 D C
id3 M N
id4 F E
"""), delim_whitespace = True)
df3 = pd.read_table(StringIO("""
id col1 col2
id1 A B
id2 D C
id3 N M
id4 E F
"""), delim_whitespace = True)
# List of n dataframes
dfs = [df1, df2, df3]
# Use frozenset to define the column values without regard for order
# pandas apply iterates over each row
# list expression iterates over each dataframe
combined_columns = [pd.Series(df.apply(lambda r: frozenset((r.col1, r.col2)), axis=1), name = 'combined') for df in dfs]
print(combined_columns)
# Results in alist of Series named 'combined'
#[0 (B, A)
# 1 (D, C)
# 2 (B, A)
# 3 (F, E)
# Name: combined, dtype: object,
# 0 (B, A)
# 1 (D, C)
# 2 (N, M)
# 3 (E, F)
# Name: combined, dtype: object,
# 0 (B, A)
# 1 (D, C)
# 2 (M, N)
# 3 (F, E)
# Name: combined, dtype: object]
dfs_combined = [pd.concat([dfs[i], combined_columns[i]], axis = 1) for i in range(len(dfs))]
print(dfs_combined)
# Result in a list of dataframes with the extra columns
#[ id col1 col2 combined
# 0 id1 A B (B, A)
# 1 id2 C D (D, C)
# 2 id3 B A (B, A)
# 3 id4 E F (F, E),
# id col1 col2 combined
# 0 id1 B A (B, A)
# 1 id2 D C (D, C)
# 2 id3 M N (N, M)
# 3 id4 F E (E, F),
# id col1 col2 combined
# 0 id1 A B (B, A)
# 1 id2 D C (D, C)
# 2 id3 N M (M, N)
# 3 id4 E F (F, E)]
# The reduce function operates on pairs, with previous result as the first argument
from functools import reduce
result = reduce(lambda df1, df2: df1[df1['combined'].isin(df2['combined'])], dfs_combined).drop_duplicates(subset='combined')
print(result)
# id col1 col2 combined
#0 id1 A B (B, A)
#1 id2 C D (D, C)
#3 id4 E F (F, E)
数据帧中只有2列?有4列,但我需要比较这两列并从其他列复制其余数据。请查看三个数据帧[df1,df2,df3]。您将看到这对(A,B)出现在所有这些中。但它是df2中的(B,A)。对(C,D)和(E,F)也是如此。所以我需要在所有数据帧中找到元素的公共对,元素可以以任何顺序出现,(A,B)或(B,A)@pygo这将简单地并排附加所有列。如果axis=0,则它将堆叠列元素。但这并没有达到预期的效果。我正在使用“jezrael”给出的答案,好的,希望您能从@jezrael's获得解决方案answer@jezrael优雅是这个解决方案的唯一词汇。顺便说一句,你们在这个论坛上的积极性和知识的深度让我深受鼓舞。您能对代码的第一部分添加一些解释吗?@pygo-我用frozenset
s;)为您创建解决方案@Ashutosh-当然,您可以按np对数据帧的每一行进行排序。排序
,并从numpy数组中为可能的调用函数DataFrame.drop_duplicates()
。此解决方案是对数据帧列表中的每个数据帧进行列表调用理解。
import pandas as pd
from io import StringIO
# Test data
df1 = pd.read_table(StringIO ("""
id col1 col2
id1 A B
id2 C D
id3 B A
id4 E F
"""), delim_whitespace = True)
df2 = pd.read_table(StringIO ("""
id col1 col2
id1 B A
id2 D C
id3 M N
id4 F E
"""), delim_whitespace = True)
df3 = pd.read_table(StringIO("""
id col1 col2
id1 A B
id2 D C
id3 N M
id4 E F
"""), delim_whitespace = True)
# List of n dataframes
dfs = [df1, df2, df3]
# Use frozenset to define the column values without regard for order
# pandas apply iterates over each row
# list expression iterates over each dataframe
combined_columns = [pd.Series(df.apply(lambda r: frozenset((r.col1, r.col2)), axis=1), name = 'combined') for df in dfs]
print(combined_columns)
# Results in alist of Series named 'combined'
#[0 (B, A)
# 1 (D, C)
# 2 (B, A)
# 3 (F, E)
# Name: combined, dtype: object,
# 0 (B, A)
# 1 (D, C)
# 2 (N, M)
# 3 (E, F)
# Name: combined, dtype: object,
# 0 (B, A)
# 1 (D, C)
# 2 (M, N)
# 3 (F, E)
# Name: combined, dtype: object]
dfs_combined = [pd.concat([dfs[i], combined_columns[i]], axis = 1) for i in range(len(dfs))]
print(dfs_combined)
# Result in a list of dataframes with the extra columns
#[ id col1 col2 combined
# 0 id1 A B (B, A)
# 1 id2 C D (D, C)
# 2 id3 B A (B, A)
# 3 id4 E F (F, E),
# id col1 col2 combined
# 0 id1 B A (B, A)
# 1 id2 D C (D, C)
# 2 id3 M N (N, M)
# 3 id4 F E (E, F),
# id col1 col2 combined
# 0 id1 A B (B, A)
# 1 id2 D C (D, C)
# 2 id3 N M (M, N)
# 3 id4 E F (F, E)]
# The reduce function operates on pairs, with previous result as the first argument
from functools import reduce
result = reduce(lambda df1, df2: df1[df1['combined'].isin(df2['combined'])], dfs_combined).drop_duplicates(subset='combined')
print(result)
# id col1 col2 combined
#0 id1 A B (B, A)
#1 id2 C D (D, C)
#3 id4 E F (F, E)