使用两个公共值合并行| Python_Python_Pandas_Dataframe

使用两个公共值合并行| Python

python pandas dataframe

使用两个公共值合并行| Python,python,pandas,dataframe,Python,Pandas,Dataframe,我一直在努力解决行与行之间的简单合并问题。我有两个具有以下列值的数据帧 df_a.columns.to_list() ['id','food','color','type','shape'] df_b.columns.to_list() ['id','food','smell','date'] 我想看看是否有食物在两个数据帧中重复，以便将它们合并到一个数据帧中 df_total = pd.concat([df_a, df_b], keys=['A', 'B'], ignore_index=F

我一直在努力解决行与行之间的简单合并问题。我有两个具有以下列值的数据帧

df_a.columns.to_list()
['id','food','color','type','shape']

df_b.columns.to_list()
['id','food','smell','date']

我想看看是否有食物在两个数据帧中重复，以便将它们合并到一个数据帧中

df_total = pd.concat([df_a, df_b], keys=['A', 'B'], ignore_index=False)
df_total = df_total.sort_values(by=['food'],ascending=True);
df_total['food'].value_counts().loc[lambda x : x>=2]

Out[1]
apple       2
cheese      2

根据这个说法，“苹果”和“奶酪”是重复的。打印连接的表时，我们得到

id     food     color     type     shape     smell       date
-----------------------------------------------------------------
 1     apple     red      fruit    round      NaN         NaT
 1     apple     NaN       NaN      NaN      soft     2020-06-05
 2     cheese  yellow     dairy   squared     NaN         NaT
 2     cheese    NaN       NaN      NaN      soft     2020-06-07
 3     lemon    green     fruit    round      NaN         NaT

期望输出：

id     food     color     type     shape     smell       date
-----------------------------------------------------------------
 1     apple     red      fruit    round     soft     2020-06-05
 2     cheese  yellow     dairy   squared    soft     2020-06-07
 3     lemon    green     fruit    round      NaN         NaT

original_df = pd.DataFrame({'food':['apple','apple','cheese','cheese','lemon'],
                           'color':['red',np.nan,'yellow',np.nan,'green'],
                           'type':['fruit',np.nan,'dairy',np.nan,'fruit'],
                           'shape':['round',np.nan,'squared',np.nan,'round'],
                           'smell':[np.nan,'soft',np.nan,'soft',np.nan],
                           'date':[np.nan,'2020-06-05',np.nan,'2020-06-07',np.nan]})

我的尝试：

这次使用pd重新定义df_total。在两个数据帧中使用合并。重置_index

df_total = pd.merge(df_a.reset_index(),df_b.reset_index(), how = 'right/left/outer/inner')

我是如何使用“right”、“left”、“outer”、“inner”的值的，但它将它们合并在一起，就像我刚刚删除了其中一行或根本没有值一样。如何获得所需的输出？
鉴于您生成的输出，由于您提供的数据不完整，我将使用
.drop\u duplicates（）
解决此问题，方法是利用其参数
subset
和
keep
，而之前使用
bfill（）
来处理缺少的值：

desired_output = original_output.bfill().drop_duplicates('food',keep='first')
例如，从您不想要的输出开始：

id food color type shape smell date ----------------------------------------------------------------- 1 apple red fruit round soft 2020-06-05 2 cheese yellow dairy squared soft 2020-06-07 3 lemon green fruit round NaN NaT

original_df = pd.DataFrame({'food':['apple','apple','cheese','cheese','lemon'], 'color':['red',np.nan,'yellow',np.nan,'green'], 'type':['fruit',np.nan,'dairy',np.nan,'fruit'], 'shape':['round',np.nan,'squared',np.nan,'round'], 'smell':[np.nan,'soft',np.nan,'soft',np.nan], 'date':[np.nan,'2020-06-05',np.nan,'2020-06-07',np.nan]})
使用以下行：

desired_df = original_df.bfill().drop_duplicates('food',keep='first')
产出：

food color type shape smell date 0 apple red fruit round soft 2020-06-05 2 cheese yellow dairy squared soft 2020-06-07 4 lemon green fruit round NaN NaN

您可以利用groupby的第一个/最后一个功能
在这种情况下：

df.groupby(['food']).last().reset_index()
输出

1 0 2 3 4 5 6 0 apple 1 red fruit round soft 2020-06-05 1 cheese 2 yellow dairy squared soft 2020-06-07 2 lemon 3 green fruit round NaN NaT

请添加更多数据，因为用您提供的内容似乎无法生成所需的输出。@Celiusstinger数据要多得多，大约有60行。只是显示了一些行以使其简单化。所有复制的行都遵循相同的模式，一行有“颜色”、“类型”和“形状”，但缺少“气味”和“日期”，另一行则相反。明白了，也许我的答案可以帮助你解决这个问题groupby+last:）+1如果具有重复值的行不在一起，这也会起作用吗？一旦分组，它们就在一起了。但是，您可能希望首先应用排序，以确保它们符合您喜欢的顺序。如果我的理解正确，.drop_duplicates（keep='first'）使用重复值的第一行，并附加其他重复行的值，对吗？此外，这是否会使字母顺序变得无用？
drop_duplicates（keep='first'）
只会删除重复的
food
列所在的行，只保留第一行。通过使用前面的
bfill（）
，我们将用进一步的值替换
NaN
值，对于每个
食物
都需要按字母顺序排列。否则，我们需要首先使用
groupby
，然后使用
bfill（）
，这是一种有效且一致的替代方法。