Python 如何找到多个数据帧中一对列与任意顺序的对的交点？_Python_Python 3.x_Pandas_Dataframe

Python 如何找到多个数据帧中一对列与任意顺序的对的交点？

python python-3.x pandas dataframe

Python 如何找到多个数据帧中一对列与任意顺序的对的交点？,python,python-3.x,pandas,dataframe,Python,Python 3.x,Pandas,Dataframe,我有多个数据帧，为了简单起见，假设我有三个 >> df1= col1 col2 id1 A B id2 C D id3 B A id4 E F >> df2= col1 col2 id1 B A id2 D C id3 M N id4 F E >&g

我有多个数据帧，为了简单起见，假设我有三个

   >> df1=
       col1  col2
   id1  A     B  
   id2  C     D  
   id3  B     A  
   id4  E     F  


    >> df2=
       col1  col2
   id1  B     A  
   id2  D     C  
   id3  M     N  
   id4  F     E  

    >> df3=
       col1  col2
   id1  A     B  
   id2  D     C  
   id3  N     M  
   id4  E     F

所需的结果是：

    >> df=
       col1  col2
   id1  A     B
   id2  C     D
   id3  E     F

因为（A，B），（C，D），（E，F）对出现在所有数据帧中，尽管它可以反转

在使用pandas merge时，它只考虑传递列的方式。为了检查我的观察结果，我对两个数据帧尝试了以下代码：

df1['reverse_1'] = (df1.col1+df1.col2).isin(df2.col1 + df2.col2)

df1['reverse_2'] = (df1.col1+df1.col2).isin(df2.col2 + df2.col1)

我发现结果不同：

col1    col2    reverse_1   reverse_2
 a        b       False      True
 c        d       False      True
 b        a       True       False
 e        f       False      True

因此，如果我从reverse_1和reverse_2列中收集“True”值，我可以得到两个数据帧的交集。即使我对两个数据帧这样做，我也不清楚如何处理更多的数据帧（多于两个）。对此我有点困惑。有什么建议吗？

您可以创建

数据框的列表，并按行进行列表理解排序，删除重复项：
dfs = [df1,df2,df3]

L = [pd.DataFrame(np.sort(x.values, axis=1), columns=x.columns).drop_duplicates() 
     for x in dfs]
print (L)
[  col1 col2
0    A    B
1    C    D
3    E    F,   col1 col2
0    A    B
1    C    D
2    M    N
3    E    F,   col1 col2
0    A    B
1    C    D
2    M    N
3    E    F]

然后通过所有列（在

上没有参数

）：
@pygo的另一个解决方案：
创建index
byfrozenset
s并通过与internal
join连接在一起，最后通过索引删除重复项，通过和获取前两列：
df = pd.concat([x.set_index(x.apply(frozenset, axis=1)) for x in dfs], axis=1, join='inner')
df = df.iloc[~df.index.duplicated(), :2]
print (df)
       col1 col2
(B, A)    A    B
(C, D)    C    D
(F, E)    E    F

与前面的一些答案有些相似
import pandas as pd
from io import StringIO 

# Test data
df1 = pd.read_table(StringIO ("""
id col1 col2
id1  A     B
id2  C     D
id3  B     A
id4  E     F
"""), delim_whitespace = True)
df2 = pd.read_table(StringIO ("""
id col1 col2
id1  B     A  
id2  D     C  
id3  M     N  
id4  F     E  
"""), delim_whitespace = True)
df3 = pd.read_table(StringIO("""
id col1 col2
id1  A     B  
id2  D     C  
id3  N     M  
id4  E     F 
"""), delim_whitespace = True)

# List of n dataframes
dfs = [df1, df2, df3]

# Use frozenset to define the column values without regard for order 
# pandas apply iterates over each row
# list expression iterates over each dataframe
combined_columns = [pd.Series(df.apply(lambda r: frozenset((r.col1, r.col2)), axis=1), name = 'combined') for df in dfs]
print(combined_columns)
# Results in  alist of Series named 'combined'
#[0    (B, A)
# 1    (D, C)
# 2    (B, A)
# 3    (F, E)
# Name: combined, dtype: object, 
# 0    (B, A)
# 1    (D, C)
# 2    (N, M)
# 3    (E, F)
# Name: combined, dtype: object, 
# 0    (B, A)
# 1    (D, C)
# 2    (M, N)
# 3    (F, E)
# Name: combined, dtype: object]

dfs_combined = [pd.concat([dfs[i], combined_columns[i]], axis = 1) for i in range(len(dfs))]
print(dfs_combined)
# Result in a list of dataframes with the extra columns
#[    id col1 col2 combined
# 0  id1    A    B   (B, A)
# 1  id2    C    D   (D, C)
# 2  id3    B    A   (B, A)
# 3  id4    E    F   (F, E),     
#     id col1 col2 combined
# 0  id1    B    A   (B, A)
# 1  id2    D    C   (D, C)
# 2  id3    M    N   (N, M)
# 3  id4    F    E   (E, F),
#     id col1 col2 combined
# 0  id1    A    B   (B, A)
# 1  id2    D    C   (D, C)
# 2  id3    N    M   (M, N)
# 3  id4    E    F   (F, E)]

# The reduce function operates on pairs, with previous result as the first argument 
from functools import reduce
result = reduce(lambda df1, df2: df1[df1['combined'].isin(df2['combined'])], dfs_combined).drop_duplicates(subset='combined')
print(result)
#    id col1 col2 combined
#0  id1    A    B   (B, A)
#1  id2    C    D   (D, C)
#3  id4    E    F   (F, E)

数据帧中只有2列？有4列，但我需要比较这两列并从其他列复制其余数据。请查看三个数据帧[df1，df2，df3]。您将看到这对（A，B）出现在所有这些中。但它是df2中的（B，A）。对（C，D）和（E，F）也是如此。所以我需要在所有数据帧中找到元素的公共对，元素可以以任何顺序出现，（A，B）或（B，A）@pygo这将简单地并排附加所有列。如果axis=0，则它将堆叠列元素。但这并没有达到预期的效果。我正在使用“jezrael”给出的答案，好的，希望您能从@jezrael's获得解决方案answer@jezrael优雅是这个解决方案的唯一词汇。顺便说一句，你们在这个论坛上的积极性和知识的深度让我深受鼓舞。您能对代码的第一部分添加一些解释吗？@pygo-我用frozenset
s；）为您创建解决方案@Ashutosh-当然，您可以按np对数据帧的每一行进行排序。排序
，并从numpy数组中为可能的调用函数DataFrame.drop_duplicates（）。此解决方案是对数据帧列表中的每个数据帧进行列表调用理解。
import pandas as pd
from io import StringIO 

# Test data
df1 = pd.read_table(StringIO ("""
id col1 col2
id1  A     B
id2  C     D
id3  B     A
id4  E     F
"""), delim_whitespace = True)
df2 = pd.read_table(StringIO ("""
id col1 col2
id1  B     A  
id2  D     C  
id3  M     N  
id4  F     E  
"""), delim_whitespace = True)
df3 = pd.read_table(StringIO("""
id col1 col2
id1  A     B  
id2  D     C  
id3  N     M  
id4  E     F 
"""), delim_whitespace = True)

# List of n dataframes
dfs = [df1, df2, df3]

# Use frozenset to define the column values without regard for order 
# pandas apply iterates over each row
# list expression iterates over each dataframe
combined_columns = [pd.Series(df.apply(lambda r: frozenset((r.col1, r.col2)), axis=1), name = 'combined') for df in dfs]
print(combined_columns)
# Results in  alist of Series named 'combined'
#[0    (B, A)
# 1    (D, C)
# 2    (B, A)
# 3    (F, E)
# Name: combined, dtype: object, 
# 0    (B, A)
# 1    (D, C)
# 2    (N, M)
# 3    (E, F)
# Name: combined, dtype: object, 
# 0    (B, A)
# 1    (D, C)
# 2    (M, N)
# 3    (F, E)
# Name: combined, dtype: object]

dfs_combined = [pd.concat([dfs[i], combined_columns[i]], axis = 1) for i in range(len(dfs))]
print(dfs_combined)
# Result in a list of dataframes with the extra columns
#[    id col1 col2 combined
# 0  id1    A    B   (B, A)
# 1  id2    C    D   (D, C)
# 2  id3    B    A   (B, A)
# 3  id4    E    F   (F, E),     
#     id col1 col2 combined
# 0  id1    B    A   (B, A)
# 1  id2    D    C   (D, C)
# 2  id3    M    N   (N, M)
# 3  id4    F    E   (E, F),
#     id col1 col2 combined
# 0  id1    A    B   (B, A)
# 1  id2    D    C   (D, C)
# 2  id3    N    M   (M, N)
# 3  id4    E    F   (F, E)]

# The reduce function operates on pairs, with previous result as the first argument 
from functools import reduce
result = reduce(lambda df1, df2: df1[df1['combined'].isin(df2['combined'])], dfs_combined).drop_duplicates(subset='combined')
print(result)
#    id col1 col2 combined
#0  id1    A    B   (B, A)
#1  id2    C    D   (D, C)
#3  id4    E    F   (F, E)