Python 如何找到多个数据帧中一对列与任意顺序的对的交点?

Python 如何找到多个数据帧中一对列与任意顺序的对的交点?,python,python-3.x,pandas,dataframe,Python,Python 3.x,Pandas,Dataframe,我有多个数据帧,为了简单起见,假设我有三个 >> df1= col1 col2 id1 A B id2 C D id3 B A id4 E F >> df2= col1 col2 id1 B A id2 D C id3 M N id4 F E >&g

我有多个数据帧,为了简单起见,假设我有三个

   >> df1=
       col1  col2
   id1  A     B  
   id2  C     D  
   id3  B     A  
   id4  E     F  


    >> df2=
       col1  col2
   id1  B     A  
   id2  D     C  
   id3  M     N  
   id4  F     E  

    >> df3=
       col1  col2
   id1  A     B  
   id2  D     C  
   id3  N     M  
   id4  E     F  
所需的结果是:

    >> df=
       col1  col2
   id1  A     B
   id2  C     D
   id3  E     F
因为(A,B),(C,D),(E,F)对出现在所有数据帧中,尽管它可以反转

在使用pandas merge时,它只考虑传递列的方式。为了检查我的观察结果,我对两个数据帧尝试了以下代码:

df1['reverse_1'] = (df1.col1+df1.col2).isin(df2.col1 + df2.col2)

df1['reverse_2'] = (df1.col1+df1.col2).isin(df2.col2 + df2.col1)
我发现结果不同:

col1    col2    reverse_1   reverse_2
 a        b       False      True
 c        d       False      True
 b        a       True       False
 e        f       False      True

因此,如果我从reverse_1和reverse_2列中收集“True”值,我可以得到两个数据帧的交集。即使我对两个数据帧这样做,我也不清楚如何处理更多的数据帧(多于两个)。对此我有点困惑。有什么建议吗?

您可以创建
数据框的列表,并按行进行列表理解排序,删除重复项:

dfs = [df1,df2,df3]

L = [pd.DataFrame(np.sort(x.values, axis=1), columns=x.columns).drop_duplicates() 
     for x in dfs]
print (L)
[  col1 col2
0    A    B
1    C    D
3    E    F,   col1 col2
0    A    B
1    C    D
2    M    N
3    E    F,   col1 col2
0    A    B
1    C    D
2    M    N
3    E    F]
然后通过所有列(在
上没有参数
):

@pygo的另一个解决方案:

创建
index
by
frozenset
s并通过与
internal
join连接在一起,最后通过索引删除重复项,通过和获取前两列:

df = pd.concat([x.set_index(x.apply(frozenset, axis=1)) for x in dfs], axis=1, join='inner')
df = df.iloc[~df.index.duplicated(), :2]
print (df)
       col1 col2
(B, A)    A    B
(C, D)    C    D
(F, E)    E    F

与前面的一些答案有些相似

import pandas as pd
from io import StringIO 

# Test data
df1 = pd.read_table(StringIO ("""
id col1 col2
id1  A     B
id2  C     D
id3  B     A
id4  E     F
"""), delim_whitespace = True)
df2 = pd.read_table(StringIO ("""
id col1 col2
id1  B     A  
id2  D     C  
id3  M     N  
id4  F     E  
"""), delim_whitespace = True)
df3 = pd.read_table(StringIO("""
id col1 col2
id1  A     B  
id2  D     C  
id3  N     M  
id4  E     F 
"""), delim_whitespace = True)

# List of n dataframes
dfs = [df1, df2, df3]

# Use frozenset to define the column values without regard for order 
# pandas apply iterates over each row
# list expression iterates over each dataframe
combined_columns = [pd.Series(df.apply(lambda r: frozenset((r.col1, r.col2)), axis=1), name = 'combined') for df in dfs]
print(combined_columns)
# Results in  alist of Series named 'combined'
#[0    (B, A)
# 1    (D, C)
# 2    (B, A)
# 3    (F, E)
# Name: combined, dtype: object, 
# 0    (B, A)
# 1    (D, C)
# 2    (N, M)
# 3    (E, F)
# Name: combined, dtype: object, 
# 0    (B, A)
# 1    (D, C)
# 2    (M, N)
# 3    (F, E)
# Name: combined, dtype: object]

dfs_combined = [pd.concat([dfs[i], combined_columns[i]], axis = 1) for i in range(len(dfs))]
print(dfs_combined)
# Result in a list of dataframes with the extra columns
#[    id col1 col2 combined
# 0  id1    A    B   (B, A)
# 1  id2    C    D   (D, C)
# 2  id3    B    A   (B, A)
# 3  id4    E    F   (F, E),     
#     id col1 col2 combined
# 0  id1    B    A   (B, A)
# 1  id2    D    C   (D, C)
# 2  id3    M    N   (N, M)
# 3  id4    F    E   (E, F),
#     id col1 col2 combined
# 0  id1    A    B   (B, A)
# 1  id2    D    C   (D, C)
# 2  id3    N    M   (M, N)
# 3  id4    E    F   (F, E)]

# The reduce function operates on pairs, with previous result as the first argument 
from functools import reduce
result = reduce(lambda df1, df2: df1[df1['combined'].isin(df2['combined'])], dfs_combined).drop_duplicates(subset='combined')
print(result)
#    id col1 col2 combined
#0  id1    A    B   (B, A)
#1  id2    C    D   (D, C)
#3  id4    E    F   (F, E)

数据帧中只有2列?有4列,但我需要比较这两列并从其他列复制其余数据。请查看三个数据帧[df1,df2,df3]。您将看到这对(A,B)出现在所有这些中。但它是df2中的(B,A)。对(C,D)和(E,F)也是如此。所以我需要在所有数据帧中找到元素的公共对,元素可以以任何顺序出现,(A,B)或(B,A)@pygo这将简单地并排附加所有列。如果axis=0,则它将堆叠列元素。但这并没有达到预期的效果。我正在使用“jezrael”给出的答案,好的,希望您能从@jezrael's获得解决方案answer@jezrael优雅是这个解决方案的唯一词汇。顺便说一句,你们在这个论坛上的积极性和知识的深度让我深受鼓舞。您能对代码的第一部分添加一些解释吗?@pygo-我用
frozenset
s;)为您创建解决方案@Ashutosh-当然,您可以按
np对数据帧的每一行进行排序。排序
,并从numpy数组中为可能的调用函数
DataFrame.drop_duplicates()
。此解决方案是对数据帧列表中的每个数据帧进行列表调用理解。
import pandas as pd
from io import StringIO 

# Test data
df1 = pd.read_table(StringIO ("""
id col1 col2
id1  A     B
id2  C     D
id3  B     A
id4  E     F
"""), delim_whitespace = True)
df2 = pd.read_table(StringIO ("""
id col1 col2
id1  B     A  
id2  D     C  
id3  M     N  
id4  F     E  
"""), delim_whitespace = True)
df3 = pd.read_table(StringIO("""
id col1 col2
id1  A     B  
id2  D     C  
id3  N     M  
id4  E     F 
"""), delim_whitespace = True)

# List of n dataframes
dfs = [df1, df2, df3]

# Use frozenset to define the column values without regard for order 
# pandas apply iterates over each row
# list expression iterates over each dataframe
combined_columns = [pd.Series(df.apply(lambda r: frozenset((r.col1, r.col2)), axis=1), name = 'combined') for df in dfs]
print(combined_columns)
# Results in  alist of Series named 'combined'
#[0    (B, A)
# 1    (D, C)
# 2    (B, A)
# 3    (F, E)
# Name: combined, dtype: object, 
# 0    (B, A)
# 1    (D, C)
# 2    (N, M)
# 3    (E, F)
# Name: combined, dtype: object, 
# 0    (B, A)
# 1    (D, C)
# 2    (M, N)
# 3    (F, E)
# Name: combined, dtype: object]

dfs_combined = [pd.concat([dfs[i], combined_columns[i]], axis = 1) for i in range(len(dfs))]
print(dfs_combined)
# Result in a list of dataframes with the extra columns
#[    id col1 col2 combined
# 0  id1    A    B   (B, A)
# 1  id2    C    D   (D, C)
# 2  id3    B    A   (B, A)
# 3  id4    E    F   (F, E),     
#     id col1 col2 combined
# 0  id1    B    A   (B, A)
# 1  id2    D    C   (D, C)
# 2  id3    M    N   (N, M)
# 3  id4    F    E   (E, F),
#     id col1 col2 combined
# 0  id1    A    B   (B, A)
# 1  id2    D    C   (D, C)
# 2  id3    N    M   (M, N)
# 3  id4    E    F   (F, E)]

# The reduce function operates on pairs, with previous result as the first argument 
from functools import reduce
result = reduce(lambda df1, df2: df1[df1['combined'].isin(df2['combined'])], dfs_combined).drop_duplicates(subset='combined')
print(result)
#    id col1 col2 combined
#0  id1    A    B   (B, A)
#1  id2    C    D   (D, C)
#3  id4    E    F   (F, E)